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NOVEL HUMAN GENES AND GENE EXPRESSION PRODUCTS I 

Cross-Ref erences to Related Applications 

This application is a continuation-in-part of U.S. provisional patent application serial 
5 no. 60/068,755, filed December 23, 1 997, and of U.S. provisional patent application serial 
no. 60/080,664, filed April 3, 1998, and of U.S. provisional patent application serial no. 
60/105,234, filed October 21, 1998, each of which applications are incorporated herein by 
reference. 

10 Field of the Invention 

The present invention relates to novel polynucleotides, particularly to novel 
polynucleotides of human origin that are expressed in a selected cell type, are differentially 
expressed in one cell type relative to another cell type (e.g., in cancerous cells, or in cells of a 
specific tissue origin) and/or share homology to polynucleotides encoding a gene product 

15 having an identified functional domain and/or activity. 

Background of the Invention 

Identification of novel polynucleotides, particularly those that encode an expressed 
gene product, is important in the advancement of drug discovery, diagnostic technologies, 

20 and the understanding of the progression and nature of complex diseases such as cancer. 
Identification of genes expressed in different cell types isolated from sources that differ in 
disease state or stage, developmental stage, exposure to various environmental factors, the 
tissue of origin, the species from which the tissue was isolated, and the like is key to 
identifying the genetic factors that are responsible for the phenotypes associated with these 

25 various differences 

This invention provides novel human polynucleotides, the polypeptides encoded by 
these polynucleotides, and the genes and proteins corresponding to these novel 
polynucleotides. 

30 Summary of the Invention 

This invention relates to novel human polynucleotides and variants thereof, their 
encoded polypeptides and variants thereof, to genes corresponding to these polynucleotides 

1 
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and to proteins expressed by the genes. The invention also relates to diagnostic and 

therapeutic agents employing such novel human polynucleotides, their corresponding genes 

or gene products, e.g., these genes and proteins, including probes, antisense constructs, and 

antibodies. 

5 Accordingly, in one embodiment, the present invention features a library of 

polynucleotides, the library comprising the sequence information of at least one of SEQ ID 
NOS: 1-844. In related aspects, the invention features a library provided on a nucleic acid 
array, or in a computer-readable format 

In one embodiment, the library is comprises a differentially expressed polynucleotide 
10 comprising a sequence selected from the group consisting of SEQ ID NOS:9, 39, 42, 52, 62, 
74, 119, 172, 317, and 379. In specific related embodiments, the library comprises: 1) a 
polynucleotide that is differentially expressed in a human breast cancer cell, where the 
polynucleotide comprises a sequence selected from the group consisting of SEQ ID NOS: 4, 
9, 39, 42, 52, 62, 65, 66, 68, 74, 81, 1 14, 123, 144, 130, 157, 162, 172, 178, 183, 202, 214, 

15 219, 223, 258, 298, 317, 338, 379, 384, 386, and 388; 2) a polynucleotide differentially 
expressed in a human colon cancer cell, where the polynucleotide comprises a sequence 
selected from the group consisting of SEQ ID NOS: 1, 39, 52, 97, 1 19, 134, 172, 176, 241, 
288, 317, 357, 362, and 374; or 3) a polynucleotide differentially expressed in a human lung 
cancer cell, where die polynucleotide comprises a sequence selected from the group 

20 consisting of SEQ ID NOS: 9, 34, 42, 62, 74, 106, 1 19, 135, 154, 160, 260, 308, 323, 349, 
361, 369, 371, 379, 395, 381, and 400. 

In another aspect, the invention features an isolated polynucleotide 
comprising a nucleotide sequence having at least 90% sequence identity to an identifying 
sequence of SEQ ID NOS: 1 -844 or a degenerate variant thereof. In related aspects, the 

25 invention features recombinant host cells and vectors comprising the polynucleotides of the 
invention, as well as isolated polypeptides encoded by the polynucleotides of the invention 
and antibodies that specifically bind such polypeptides. 

In one embodiment, the invention features an isolated polynucleotide comprising a 
sequence encoding a polypeptide of a protein family selected from the group consisting of: 

50 4 transmembrane segments integral membrane proteins, 7 transmembrane receptors, 

ATPases associated with various cellular activities (AAA), eukaryotic aspartyl proteases, 

2 
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"""v of transcription factors, G-protein alpha subunit, phorbol esters/diacylglycerol 
binding proteins, protein kinase, protein phosphatase 2C, protein tyrosine phosphatase, 
trypsin, wnt family of developmental signaling proteins, and WW/rsp5/WWP domain 
containing proteins. In a specific related embodiment, the invention features a 
polynucleotide comprising a sequence of one of SEQ ID NOS: 24, 41, 101, 157, 291, 305 
315,341,63, 116, 134, 136, 151,384,404, 308,213, 367, 188, 251,202,315, 367, 397,256, 
382, 169, 23, 291, 324, 330, 341, 353, 188, 379 , and 395. 

In another embodiment, the invention features a polynucleotide comprising a 
sequence encoding a polypeptide having a functional domain selected from the group 
consisting of: Ank repeat, basic region plus leucine zipper transcription factors, 
bromodomain, EF-hand, SH3 domain, WD domain/G-beta repeats, zinc finger (C2H2 type), 
zinc finger (CCHC class), and zinc-binding metalloprotease domain. In a specific related 
embodiment, the invention features a polynucleotide comprising a sequence of one of SEQ 
ID NOS: 1 16, 251, 374, 97, 136, 242, 379, 306, 386, 18, 335, 61, 306, 386, 322, 306, and 
15 395. 

In another aspect, the invention features a method of detecting differentially 
expressed genes correlated with a cancerous state of a mammalian cell, where the method 
comprises the step of detecting at least one differentially expressed gene product in a test 
sample derived from a cell suspected of being cancerous, where the gene product is encoded 
20 by a gene corresponding to a sequence of at least one of SEQ ID NOS:4, 9, 39, 42, 52, 62, 
65, 66, 68, 74, 81, 1 14, 123, 144, 130, 157, 162, 172, 178, 183, 202, 214, 219, 223, 258, 298, 
317, 338, 379, 384, 386, 388, 1, 39, 52, 97, 1 19, 134, 172, 176, 241, 288, 317, 357, 362, 374, 
9,34,42,62,74, 106, 119, 135, 154, 160,260,308, 323,349, 361,369, 371,379,395, 381, 
and 400. ^Detection of the differentially expressed gene product is correlated with a 
25 cancerous state of the cell from which the test sample was derived. In one embodiment, the 
detecting is by hybridization of the test sample to a reference array, wherein the reference 
array comprises an identifying sequence of at least one of SEQ ID NOS: 1-844. 

In one embodiment of the method of the invention, the cell is a breast tissue derived 
cell, and the differentially expressed gene product is encoded by a gene corresponding to a 
30 sequence of at least one of SEQ ID NOS: 4, 9, 39, 42, 52, 62, 65, 66, 68, 74, 81, 1 14, 123, 
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144, 130, 157, 162, 172, 178, 183, 202, 214, 219, 223, 258, 298, 317, 338, 379, 384, 386, 

and 388. 

In another embodiment of the method of the invention, the ceil is a colon tissue 
derived cell, and differentially expressed gene product is encoded by a gene corresponding to 
5 a sequence of at least one of SEQ ID NOS: 1, 39, 52, 97, . 119, 134, 172, 176,241,288,317, 
357, 362, and 374. 

In yet another embodiment of the method of the invention, the cell is a lung tissue 
derived cell, and differentially expressed gene product is encoded by a gene corresponding to 
a sequence of at least one of SEQ ID NOS: 9, 34, 42, 62, 74, 106, 1 19, 135, 154, 160, 260, 
10 308, 323, 349, 361, 369, 371, 379, 395, 381, and 400. 

Other aspects and embodiments of the invention will be readily apparent to the 
ordinarily skilled artisan upon reading the description provided herein. 

Detailed Description of the Invention 

15 The invention relates to polynucleotides comprising the disclosed nucleotide 

sequences, to full length cDNA, mRNA and genes corresponding to these sequences, and to 
polypeptides and proteins encoded by these polynucleotides and genes. 

Also included are polynucleotides that encode polypeptides and proteins encoded by 
the polynucleotides of the Sequence Listing. The various polynucleotides that can encode 

20 these polypeptides and proteins differ because of the degeneracy of the genetic code, in that 
most amino acids are encoded by more than one triplet codon. The identity of such codons 
is well-known in this art, and this information can be used for the construction of the 
polynucleotides within the scope of the invention. 

Polynucleotides encoding polypeptides and proteins that are variants of the 

25 polypeptides and proteins encoded by the polynucleotides and related cDNA and genes are 
also within the scope of the invention. The variants differ from wild type protein in having 
one or more amino acid substitutions that either enhance, add, or diminish a biological 
activity of the wild type protein. Once the amino acid change is selected, a polynucleotide 
encoding that variant is constructed according to the invention. 

*0 The following detailed description describes the polynucleotide compositions 

encompassed by the invention, methods for obtaining cDNA or genomic DNA encoding a 
full-length gene product, expression of these polynucleotides and genes, identification of 

4 
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structural motifs of the polynucleotides and genes, identification of the function of a gene 

product encoded by a gene corresponding to a polynucleotide of the invention, use of the 

provided polynucleotides as probes and in mapping and in tissue profiling, use of the 

corresponding polypeptides and other gene products to raise antibodies, and use of the 

5 polynucleotides and their encoded gene products for therapeutic and diagnostic purposes. 

!• Polynucle otide Compositions 

The scope of the invention with respect to polynucleotide compositions includes, but is not 
necessarily limited to, polynucleotides having a sequence set forth in any one of SEQ ID 
10 NOS:l-844; polynucleotides obtained from the biological materials described herein or other 
biological sources (particularly human sources) by hybridization under stringent conditions 
(particularly conditions of high stringency); genes corresponding to the provided 
polynucleotides; variants of the provided polynucleotides and their corresponding genes, 
particularly those variants that retain a biological activity of the encoded gene product (e.g., 
1 5 a biological activity ascribed to a gene product corresponding to the provided 

polynucleotides as a result of the assignment of the gene product to a protein family(ies) 
and/or identification of a functional domain present in the gene product). Other nucleic acid 
compositions contemplated by and within the scope of the present invention will be readily 
apparent to one of ordinary skill in the art when provided with the disclosure here. 

The invention features polynucleotides that are expressed in cells of human tissue, 
specifically human colon, breast, and/or lung tissue. Novel nucleic acid compositions of the 
invention of particular interest comprise a sequence set forth in any one of SEQ ID NOS: 1- 
844 or an identifying sequence thereof. An "identifying sequence" is a contiguous sequence 
of residues-at least about 10 nt to about 20 nt in length, usually at least about 50 nt to about 
100 nt in length, that uniquely identifies a polynucleotide sequence, e.g., exhioits less than 
90%, usually less than about 80% to about 85% sequence identity to any contiguous 
nucleotide sequence of more than about 20 nL Thus, the subject novel nucleic acid 
compositions include full length cDNAs or mRNAs that encompass an identifying sequence 
of contiguous nucleotides from any one of SEQ ID NOS: 1-844. 

The polynucleotides of the invention also include polynucleotides having sequence 
similarity or sequence identity. Nucleic acids having sequence similarity are detected by 



20 
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hybridization under low stringency conditions, for example, at 50°C and IOXSSC (0.9 M 

saline/0.09 M sodium citrate) and remain bound when subjected to washing at 55°C in 

1XSSC. Sequence identity can be determined by hybridization under stringent conditions, 

for example, at 50°C or higher and 0.1XSSC (9 mM saline/0.9 mM sodium citrate). 

5 Hybridization methods and conditions are well known in the art, see, e.g. , U.S. Patent No. . 

5,707,829. Nucleic acids that are substantially identical to the provided polynucleotide 

sequences, e.g. allelic variants, genetically altered versions of the gene, etc, bind to the . 

provided polynucleotide sequences (SEQ ID NOS: 1-844) under stringent hybridization 

conditions. By using probes, particularly labeled probes of DNA sequences, one can isolate 

10 homologous or related genes. The source of homologous genes can be any species, e.g. 

primate species, particularly human; rodents, such as rats and mice, canines, felines, bovines, 
o vines, equines, yeast, nematodes, etc. 

Preferably, hybridization is performed using at least 15 contiguous nucleotides of at 
least one of SEQ ID NOS: 1-844. That is, when at least 15 contiguous nucleotides of one of 

15 the disclosed SEQ ID NOs. is used as a probe, the probe will preferentially hybridize with a 
gene or mRNA (of the biological material) comprising the complementary sequence, 
allowing the identification and retrieval of the nucleic acids of the biological material that 
uniquely hybridize to the selected probe. Probes from more than one SEQ ID NO. will 
hybridize with the same gene or mRNA if the cDNA from which they were derived 

20 corresponds to one mRNA. Probes of more than 15 nucleotides can be used, but 15 
nucleotides represents enough sequence for unique identification. 

The polynucleotides of the invention also include naturally occurring variants of the 
nucleotide sequences (e.g., degenerate variants, allelic variants, etc.). Variants of the 
polynucleotides of the invention are identified by hybridization of putative variants with 

25 nucleotide sequences disclosed herein, preferably by hybridization under stringent conditions 
For example, by using appropriate wash conditions, variants of the polynucleotides of the 
invention can be identified where the allelic variant exhibits at most about 25-30% base pair 
mismatches relative to the selected polynucleotide probe. In general, allelic variants contain 
15-25% base pair mismatches, and can contain as little as even 5-15%, or 2-5%, or 1-2% 

30 base pair mismatches, as well as a single base-pair mismatch. 

6 
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ine invention also encompasses homologs corresponding to the polynucleotides of 

SEQ ID NOS.1-844, where the source of homologous genes can be any mammalian species, 
e.g., primate species, particularly human; rodents, such as rats, canines, felines, bovines, 
ovines, equines, yeast, nematodes, etc. Between mammalian species, e.g., human and 
mouse, homologs have substantial sequence similarity, e.g., at least 75% sequence identity, 
usually at least 90%, more usually at least 95% between nucleotide sequences. Sequence 
similarity is calculated based on a reference sequence, which may be a subset of a larger 
sequence, such as a conserved motif, coding region, flanking region, etc. A reference 
sequence will usually be at least about 1 8 contiguous nt long, more usually at least about 30 
nt long, and may extend to the complete sequence that is being compared. Algorithms for 
sequence analysis are known in the art, such as BLAST, described in Altschul et at., J. Mol. 
Biol. (1990) 275:403-10. 

In general, variants of the invention have a sequence identity greater than at least 
about 65%, preferably at least about 75%, more preferably at least about 85%, and can be 
greater than at least about 90% or more as determined by the Smith- Waterman homology 
search algorithm as implemented in MPSRCH program (Oxford Molecular). For the 
purposes of this invention, a preferred method of calculating percent identity is the Smith- 
Waterman algorithm, using the following. Global DNA sequence identity must be greater 
than 65% as determined by the Smith-Waterman homology search algorithm as implemented 
in MPSRCH program (Oxford Molecular) using an affine gap search with the following 
search parameters: gap open penalty, 12; and gap extension penalty, 1. 

The subject nucleic acids can be cDNAs or genomic DNAs, as well as fragments 
thereof, particularly fragments that encode a biologically active gene product and/or are 
useful in the methods disclosed herein (e.g., in diagnosis, as a unique identifier of a 
differentially expressed gene of interest, etc.). The term "cDNA" as used herein is intended 
to include all nucleic acids that share the arrangement of sequence elements found in native 
mature mRNA species, where sequence elements are exons and 3 and 5 non-coding 
regions. Normally mRNA species have contiguous exons, with the intervening introns, 
when present, being removed by nuclear RNA splicing, to create a continuous open reading 
frame encoding a polypeptide of the invention. 
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A genomic sequence of interest comprises the nucleic acid present between the 

initiation codon and the stop codon, as defined in the listed sequences, including all of the 
introns that are normally present in a native chromosome. It can further include the 3 and 5 
untranslated regions found in the mature mRNA. It can further include specific 
5 transcriptional and transitional regulatory sequences, such as promoters, enhancers, etc., 
including about 1 kb, but possibly more, of flanking genomic DNA at either the 5 and 3 end 
of the transcribed region. The genomic DNA can be isolated as a fragment of 100 kbp or 
smaller; and substantially free of flanking chromosomal sequence. The genomic DNA 
flanking the coding region, either 3 and 5 , or internal regulatory sequences as sometimes 
10 found in introns, contains sequences required for proper tissue, stage-specific, or disease- 
state specific expression. 

The nucleic acid compositions of the subject invention can encode all or a part of the 
subject differentially expressed polypeptides. Double or single stranded fragments can be 
obtained from the DNA sequence by chemically synthesizing oligonucleotides in accordance 
15 with conventional methods, by restriction enzyme digestion, by PCR amplification, etc. 
Isolated polynucleotides and polynucleotide fragments of the invention comprise at least 
about 10, about IS, about 20, about 35, about 50, about 100, about 150 to about 200, about 
250 to about 300, or about 350 contiguous nucleotides selected from the polynucleotide 
sequences as shown in SEQ ID NOS: 1-844. For the most part, fragments will be of at least 
20 15 nt, usually at least 18 nt or 25 nt, and up to at least about 50 contiguous nt in length or 
more. In a preferred embodiment, the polynucleotide molecules comprise a contiguous 
sequence of at least twelve nucleotides selected from the group consisting of the 
polynucleotides shown in SEQ ID NOS: 1-844. 

Probes specific to the polynucleotides of the invention can be generated using the 
25 polynucleotide sequences disclosed in SEQ ID NOS: 1 -844. The probes are preferably at 
least about 12, 15, 16, 18, 20, 22, 24, or 25 nucleotide fragment of a corresponding 
contiguous sequence of SEQ ID NOS: 1-844, and can be less than 2, 1, 0.5, 0.1, or 0.05 kb in 
length. The probes can be synthesized chemically or can be generated from longer 
polynucleotides using restriction enzymes. The probes can be labeled, for example, with a 
30 radioactive, biotinyiated, or fluorescent tag. Preferably, probes are designed based upon an 
identifying sequence of a polynucleotide of one of SEQ ID NOS: 1-844. More preferably, 
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prooes are designed based on a contiguous sequence of one of the subject polynucleotides 
that remain unmasked following application of a masking program for masking low 
complexity (e.g., XBLAST) to the sequence., U., one would select an unmasked region, as 
indicated by the polynucleotides outside the poly-n stretches of the masked sequence 
produced by the masking program. 

The polynucleotides of the subject invention are isolated and obtained in substantial 
purity, generally as other than an intact chromosome. Usually, the polynucleotides, either as 
DNA.or RNA, will be obtained substantially fiee of other naturally-occurring nucleic acid 
sequences, generally being at least about 50%, usually at least about 90% pure and are 
typically "recombinant", e.g., flanked by one or more nucleotides with which it is not 
normally associated on a naturally occurring chromosome. 

The polynucleotides of the invention can be provided as a linear molecule or within a 
circular molecule. They can be provided within autonomously replicating molecules 
(vectors) or within molecules without replication sequences. They can be regulated by their 
15 own or by other regulatory sequences, as is known in the art. The polynucleotides of the 
invention can be introduced into suitable host cells using a variety of techniques which are 
available in the art* such as transferrin polycation-mediated DNA transfer, transfection with 
naked or encapsulated nucleic acids, liposome-mediated DNA transfer, intracellular 
transportation of DNA-coated latex beads, protoplast fusion, viral infection, electroporation, 
20 gene gun, calcium phosphate-mediated transfection, and the like. 

The subject nucleic acid compositions can be used to, for example, produce 
polypeptides, as probes for the detection of mRNA of the invention in biological samples 
(e.g., extracts of human cells) to generate additional copies of the polynucleotides, to 
generate ribozymes or antisense oligonucleotides, and as single stranded DNA probes or as 
25 triple-strand forming oligonucleotides. The probes described herein can be used to, for 
example, determine the presence or absence of the polynucleotide sequences as shown in 
SEQ ID NOS: 1-844 or variants thereof in a sample. These and other uses are described in 
more detail below. 
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Use of Polynucleoti des to Obtain Full-Length cDNA and Full-Length Human Gene and 

Promoter Region 

Full-length cDNA molecules comprising the disclosed polynucleotides are obtained 
as follows. A polynucleotide having a sequence of one of SEQ ID NOS: 1 -844, or a portion 
thereof comprising at least 12, 15, 18, or 20 nucleotides, is used as a hybridization probe to 
detect hybridizing members of a cDNA library using probe design methods, cloning 
methods, and clone selection techniques such as those described in U.S. Patent No. 
5,654,173. Libraries of cDNA are made from selected tissues, such as normal or tumor 
tissue, or from tissues of a mammal treated with, for example, a pharmaceutical agent. 
Preferably, the tissue is the same as the tissue from which the polynucleotides of the 
invention were isolated, as both the polynucleotides described herein and the cDNA 
represent expressed genes. Most preferably, the cDNA library is made from the biological 
material described herein in the Examples. Alternatively, many cDNA libraries are available 
commercially. (Sambrook et aL, Molecular Cloning: A Laboratory Manual 2nd Ed, (1989) 
Cold Spring Harbor Press, Cold Spring Harbor, NY). The choice of cell type for library 
construction can be made after the identity of the protein encoded by the gene corresponding 
to the polynucleotide of the invention is known. This will indicate which tissue and cell 
types are likely to express the related gene, and thus represent a suitable source for the 
mRNA for generating the cDNA. Where the provided polynucleotides are isolated from 
cDNA libraries, the libraries are prepared from mRNA of human colon cells, more 
preferably, human colon cancer cells, even more preferably, from a highly metastatic colon 
cell, Kml2L4-A. 

Techniques for producing and probing nucleic acid sequence libraries are described, 
for example, in Sambrook et al^ Molecular Cloning: A Laboratory Manual 2nd Ed. , ( 1 989) 
Cold Spring Harbor Press, Cold Spring Harbor, NY, The cDNA can be prepared by using 
primers based on sequence from SEQ ID NOS:l-844. In one embodiment, the cDNA library 
can be made from only poly-adenylated mRNA. Thus, poly-T primers can be used to 
prepare cDNA from the mRNA. 

Members of the library that are larger than the provided polynucleotides, and 
preferably that encompass the complete coding sequence of the native message, are obtained. 
In order to confirm that the entire cDNA has been obtained, RNA protection experiments 

10 
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are performed as follows. Hybridization of a full-length cDNA to an mRNA will protect the 
RNA from RNase degradation. If the cDNA is not full length, then the portions of the 
mRNA that are not hybridized will be subject to RNase degradation. This is assayed, as is 
known in the art, by changes in eiectrophoretic mobility on polyacrylamide gels, or by 
detection of released monoribonucleotides. Sambrook et al., Molecular Cloning: A 
Laboratory Manual, 2nd Ea\, (1989) Cold Spring Harbor Press, Cold Spring Harbor, NY. In 
order to obtain additional sequences 5' to the end of a partial cDNA, 5' RACE (PCR 
Protocols: A Guide to Methods and Applications, (1990) Academic Press, Inc.) is performed. 

Genomic DNA is isolated using the provided polynucleotides in a manner similar to 
the isolation of full-length cDNAs. Briefly, the provided polynucleotides, or portions 
thereof, are used as probes to libraries of genomic DNA. Preferably, the library is obtained 
from the cell type that was used to generate the polynucleotides of the invention, but this is 
not essential. Most preferably, the genomic DNA is obtained from the biological material 
described herein in the Examples. Such libraries can be in vectors suitable for carrying large 
segments of a genome, such as PI or YAC, as described in detail in Sambrook et al. 9.4- 
9 JO. In addition, genomic sequences can be isolated from human BAC libraries, which are 
commercially available from Research Genetics, Inc., Huntville, Alabama, USA, for 
example. In order to obtain additional 5' or 3' sequences, chromosome walking is performed, 
as described in Sambrook et al., such that adjacent and overlapping fragments of genomic 
DNA are isolated. These are mapped and pieced together, as is known in the art, using 
restriction digestion enzymes and DNA Iigase. 

Using the polynucleotide sequences of the invention, corresponding full-length genes 
can be isolated using both classical and PCR methods to construct and probe cDNA libraries. 
Using either method, Northern blots, preferably, are performed on a number of cell types to 
determine which cell lines express the gene of interest at the highest level. Classical 
methods of constructing cDNA libraries are taught in Sambrook et al., supra. With these 
methods, cDNA can be produced from mRNA and inserted into viral or expression Vectors. 
Typically, libraries of mRNA comprising poly(A) tails can be produced with poIyfT) 
primers. Similarly, cDNA libraries can be produced using the instant sequences as primers. 

PCR methods are used to amplify the members of a cDNA library that comprise the 
desired insert. In this case, the desired insert will contain sequence from the full length 
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cDNA that corresponds to the instant polynucleotides. Such PCR methods include gene 

trapping and RACE methods. Gene trapping entails inserting a member of a cDNA library 

into a vector. The vector then is denatured to produce single stranded molecules. Next, a 

substrate-bound probe, such a biotinylated oligo, is used to trap cDNA inserts of interest. 

5 Biotinylated probes can be linked to an avidin-bound solid substrate. PCR methods can be 

used to amplify the trapped cDNA. To trap sequences corresponding to the full length 

genes, the labeled probe sequence is based on the polynucleotide sequences of the invention. 

Random primers or primers specific to the library vector can be used to amplify the trapped 

cDNA. Such gene trapping techniques are described in Gruber et al^ WO 95/04745 and 

10 Gruber et al. 9 U.S. Pat. No. 5,500,356. Kits are commercially available to perform gene 

trapping experiments from, for example, Life Technologies, Gaithersburg, Maryland, USA. 

"Rapid amplification of cDNfA ends," or RACE, is a PCR method of amplifying 
cDNAs from a number of different RNAs. The cDNAs are ligated to an oligonucleotide 
linker, and amplified by PCR using two primers. One primer is based on sequence from the 

15 instant polynucleotides, for which full length sequence is desired, and a second primer 
comprises sequence that hybridizes to the oligonucleotide linker to amplify the cDNA. A 
description of this methods is reported in WO 97/191 10. In preferred embodiments of 
RACE, a common primer is designed to anneal to an arbitrary adaptor sequence ligated to 
cDNA ends (Apte and Siebert, Biotechniques (1993) 75:890-893; Edwards et aL, Nuc. Acids 

20 Res. (1991) 79:5227-5232). When a single gene-specific RACE primer is paired with the 
common primer, preferential amplification of sequences between the single gene specific 
primer and the common primer occurs. Commercial cDNA pools modified for use in RACE 
are available. 

Another PCR-based method generates full-length cDNA library with anchored ends 
25 without needing specific knowledge of the cDNA sequence. The method uses lock-docking 
primers (I- VI), where one primer, poly TV (I-III) locks over the poIyA tail of eukaryotic 
mRNA producing first strand synthesis and a second primer, polyGH (TV- VI) locks onto the 
polyC tail added by terminal deoxynucleotidyl transferase (TdT). This method is described 
in WO 96/40998. 

30 The promoter region of a gene generally is located 5' to the initiation site for RNA 

polymerase II. Hundreds of promoter regions contain the "TATA" box, a sequence such as 
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I A I 1 A or TATAA, which is sensiUve to mutations. The promoter region can be obtained 
by performing 5' RACE using a primer from the coding region of the gene. Alternatively, 
the cDNA can be used as a probe for the genomic sequence, and the region 5' to the coding 
region is identified by "walking up." If the gene is highly expressed or differentially 
expressed, the promoter from the gene can be of use in a regulatory construct for a 
heterologous gene. 

Once the full-length cDNA or gene is obtained, DNA encoding variants can be 
prepared by site-directed mutagenesis, described in detail in Sambrook et aL, 15.3-15.63. 
The choice of codon or nucleotide to be replaced can be based on disclosure herein on 
optional changes in amino acids to achieve altered protein structure and/or function. 

As an alternative method to obtaining DNA or RNA from a biological material, 
nucleic acid comprising nucleotides having the sequence of one or more polynucleotides of 
the invention can be synthesized. Thus, the invention encompasses nucleic acid molecules 
ranging in length from 15 nucleotides (corresponding to at least 15 contiguous nucleotides of 
one of SEQ ID NOS: 1-844) up to a maximum length suitable for one or more biological 
manipulations, including replication and expression, of the nucleic acid molecule. The 
invention includes but is not limited to (a) nucleic acid having the size of a full gene, and 
comprising at least one of SEQ ID NOS: 1-844; (b) the nucleic acid of (a) also comprising at 
least one additional gene, operably linked to permit expression of a fusion protein; (c) an 
expression vector comprising (a) or (b); (d) a plasmid comprising (a) or (b) ; and (e) a 
recombinant viral particle comprising (a) or (b). Once provided with the polynucleotides 
disclosed herein, construction or preparation of (a) - (e) are well within the skill in the art 
Thesequence of a nucleic acid comprising at least 15 contiguous nucleotides of at 
least any one of SEQ ID NOS: 1-844, preferably the entire sequence of at least any one of 
SEQ ID NOS: 1-844, is not limited and can be any sequence of A, T, G, and/or C (for DNA) 
and A, U, G, and/or C (for RNA) or modified bases thereof, including inosine and 
pseudouridine. The choice of sequence will depend on the desired function and can be 
dictated by coding regions desired, the intron-like regions desired, and the regulatory regions 
desired. Where the entire sequence of any one of SEQ ID NOS: 1-844 is within the nucleic 
30 acid, the nucleic acid obtained is referred to herein as a polynucleotide comprising the 
sequence of any one of SEQ ID NOS: 1-844. 
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II. Expression of Pol ypeptide Encoded bv Full-Length cDNA or Full-Length Gene 

The provided polynucleotide (e.g., a polynucleotide having a sequence of one of SEQ 
ID NOS: 1-844), the corresponding cDNA, or the full-length gene is used to express a partial 
5 or complete gene product 

Constructs of polynucleotides having sequences of SEQ ID NOS: 1-844 can be 
generated synthetically. Alternatively, single-step assembly of a gene and entire piasmid 
from large numbers of oligodeoxyribonucleotides is described by, e.g., Stemmer et aL, Gene 
(Amsterdam) (1 995) 164(1):49-S3. In this method, assembly PCR (the synthesis of long 

10 DNA sequences from large numbers of oligodeoxyribonucleotides (oligos)) is described. 
The method is derived from DNA shuffling (Stemmer, Nature (1994) 570:389-391), and 
does not rely on DNA ligase, but instead relies on DNA polymerase to build increasingly 
longer DNA fragments during the assembly process. For example, a 1 . 1 -kb fragment 
containing the TEM-1 beta-Iactamase-encoding gene (bla) can be assembled in a single 

1 5 reaction from a total of 56 oligos, each 40 nucleotides (nt) in length. The synthetic gene can 
be PCR amplified and cloned in a vector containing the tetracycline-resistance gene (T c-R) 
as the sole selectable marker. Without relying on ampicillin (Ap) selection, 76% of the Tc-R 
colonies were Ap-R, making this approach a general method for the rapid and cost-effective 
synthesis of any gene. 

20 Appropriate polynucleotide constructs are purified using standard recombinant DNA 

techniques as described in, for example, Sambrook et al., Molecular Cloning: A Laboratory 
Manual, 2nd Ed, (1989) Cold Spring Harbor Press, Cold Spring Harbor, NY, and under 
current regulations described in United States Dept. of HHS, National Institute of Health 
(NIH) Guidelines for Recombinant DNA Research. The gene product encoded by a 

25 polynucleotide of the invention is expressed in any expression system, including, for 

example, bacterial, yeast, insect, amphibian and mammalian systems. Suitable vectors and 
host cells are described in U.S. Patent No. 5,654,173. 

Bacteria. Expression systems in bacteria include those described in Chang et aL, 
Nature (1978) 275:615; Goeddel et aL, Nature (1979) 281:544; Goeddel et aL, Nucleic Acids 

30 Res. (1980) 5:4057; EP 0 036,776; U.S. Patent No. 4,551,433; DeBoer et aL Proc. Natl 
Acad ScL (USA) (1983) 50:21-25; and Siebenlist et aL, Cell (1980) 20:269. 
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XSast Expression systems in yeast include those described in Hinnen et al.. Proc. 
Natl. Acad Sci. (USA) (1978) 75:1929; Ito etaL.J. Bacterial. (1983) 153:163; Kxxrtzetal., 
Mol. Cell. Biol. (1986) tf:142; Kunze et al., J. Basic Microbiol. (1985) 25:141; Gleeson et 
al.,J. Gen. Microbiol. (1986) 752:3459; Roggenkamp et al., Mol. Gen. Genet. (1986) 
5 202:302; Das et al., J. Bacterial. (1984) 755: 1 165; De Louvencourt et al., J. Bacterial. 
(1983) 154:737; Van den Berg et al., Bio/Technology (1990) 5:135; Kunze et al., J. Basic 
Microbiol. (1985) 25:141; Cregg et al., Mol. Cell. Biol. (1985) 5:3376; U.S. Patent Nos. 
4,837, 148 and 4,929,555; Beach and Nurse, Nature (1 98 1 ) 500:706; Davidow et al., Curr. 
Genet. (1985) 70:380; Gaillardin et al., Curr. Genet. (1985) 70:49; Ballance et al., Biochem. 
10 Biophys.Res. Commun. (1983) 772:284-289; Tilbum et al., Gene (\9Z3) 25:205-221; Yelton 
et al., Proc. Natl. Acad. Sci. (USA) (1984) 57:1470-1474; Kelly and Hynes, EMBOJ. (1985) 
4:475479; EP 0 244,234; and WO 91/00357. 

Insect Cells. Expression of heterologous genes in insects is accomplished as 
described in U.S. Patent No. 4,745,051; Friesen et al., "The Regulation of Baculovirus Gene 
15 Expression", in: The Molecular Biology OfBaculoviruses (1986) (W. Doerfler, ed.); EP 0 
127,839; EP0 155,476; and Vlaket al., J. Gen. Virol. (1988) 69:765-776; Miller et al., Ann. 
Rev. Microbiol. (1988) 42:177; Carbonell et al.. Gene (1988) 75:409; Maeda et al., Nature 
(1 985) 575:592-594; Lebacq-Verheyden et al., Mol. Cell Biol. (1988) 8:3 129; Smith et al., 
Proc. Natl. Acad Sci. (USA) (1985) 52:8844; Miyajima et al., Gene (1987) 55:273; and 
20 Martin et al., DNA (1988) 7:99. Numerous baculoviral strains and variants and 

corresponding permissive insect host cells from hosts are described in Luckow et al., 
Biotechnology (1988) d*:47-55, Miller e/ a/., Generic Engineering (1986) 5:277-279, and 
Maeda etal., Nature (1 985) 575:592-594. 

Mammalian Cells. Mammalian expression is accomplished as described in Dijkema 
25 et al., EMBOJ. (1985) 4:761, Gorman et al^ Prod. Natl. Acad Sci. (USA) (1982) 79:6777, 
Boshart et al., Cell (1985) 47:521 and U.S. Patent No. 4,399,216. Other features of 
mammaUan expression are facilitated as described in Ham and Wallace, Meth. Enz. (1979) 
55:44, Barnes and Sato, Anal. Biochem. (1980) 702:255, U.S. Patent Nos. 4,767,704, 
4,657,866, 4,927,762, 4,560,655, WO 90/103430, WO 87/00195, and U.S. RE 30,985. 

Polynucleotide molecules comprising a polynucleotide sequence provided herein 
propagated by placing the molecule in -a vector. Viral and non-viral vectors are used, 
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including plasmids. The choice of plasmid will depend on the type of cell in which 

propagation is desired and the purpose of propagation. Certain vectors are useful for 

amplifying and making large amounts of the desired DNA sequence. Other vectors are 

suitable for expression in cells in culture. Still other vectors are suitable for transfer and 

5 expression in cells in a whole animal or person. The choice of appropriate vector is well 

within the skill of the art Many such vectors are available commercially. The partial or 

full-length polynucleotide is inserted into a vector typically by means of DNA ligase 

attachment to a cleaved restriction enzyme site in the vector. Alternatively, the desired 

nucleotide sequence can be inserted by homologous recombination in vivo. Typically this is 

10 accomplished by attaching regions of homology to the vector on the flanks of the desired 
nucleotide sequence. Regions of homology are added by ligation of oligonucleotides, or by 
polymerase chain reaction using primers comprising both the region of homology and a 
portion of the desired nucleotide sequence, for example. 

The polynucleotides set forth in SEQ ID NOS: 1-844 or their corresponding full- 

15 length polynucleotides are linked to regulatory sequences as appropriate to obtain the desired 
expression properties. These can include promoters (attached either at the 5* end of the sense 
strand or at the 3* end of the antisense strand), enhancers, terminators, operators, repressors, 
and inducers. The promoters can be regulated or constitutive. In some situations it may be 
desirable to use conditionally active promoters, such as tissue-specific or developmental 

20 stage-specific promoters. These are linked to the desired nucleotide sequence using the. 
techniques described above for linkage to vectors. Any techniques known in the art can be 
-used. 

When any of the above host cells, or other appropriate host cells or organisms, are 
used to replicate and/or express the polynucleotides or nucleic acids of the invention, the 
25 resulting replicated nucleic acid, RNA, expressed protein or polypeptide, is within the scope 
of the invention as a product of the host cell or organism. The product is recovered by any 
appropriate means known in the art 

Once die gene corresponding to a selected polynucleotide is identified, its expression 
can be regulated in the cell to which the gene is native. For example, an endogenous gene of 
JO a cell can be regulated by an exogenous regulatory sequence as disclosed in U.S. Patent No. 
5,641,670. 
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IIL Identification of Functio nal and Structural Motifs of Novel Penes 

A * Screening Polynucleotide Sequences a nd Amino Acid Seq uence A ffa .W 
Publicly Available Dafshagre 
5 Translations of the nucleotide sequence of the provided polynucleotides, cDNAs or 

full genes can be aligned with individual known sequences. Similarity with individual 
sequences can be used to determine the activity of the polypeptides encoded by the 
polynucleotides of the invention. For example, sequences that show similarity with a 
chemokine sequence can exhibit chemokine activities. Also, sequences exhibiting similarity 
10 with more than one individual sequence can exhibit activities that are characteristic of either 
or both individual sequences. 

The full length sequences and fragments of the polynucleotide sequences of the 
nearest neighbors can be used as probes and primers to identify and isolate the full length 
sequence corresponding to provided polynucleotides. The nearest neighbors can indicate a 
15 tissue or cell type to be used to construct a library for the full-length sequences 
corresponding to the provided polynucleotides.. 

Typically, a selected polynucleotide is translated in all six frames to determine the 
best alignment with the individual sequences. The sequences disclosed herein in the 
Sequence Listing arc in a 5' to 3' orientation and translation in three frames can be sufficient 
20 (with a few specific exceptions as described in the Examples). These amino acid sequences 
are referred to, generally, as query sequences, which will be aligned with the individual 
sequences, Databases with individual sequences are described in "Computer Methods for 
Macromolecular Sequence Analysis" Methods in Enzymology (1996) 266, Doolittle, 
AcademicPress, Inc., a division of Harcourt Brace & Co., San Diego, CaUfomia, USA. 
25 Databases include Genbank, EMBL, and DNA Database of Japan (DDBJ). 

Query and individual sequences can be aligned using the methods and computer 
programs described above, and include BLAST, available over the world wide web at 
http://ww.ncbi.nlm.nih.pov/Rr,AST/ . Another alignment algorithm is Fasta, available in the 
Genetics Computing Group (GCG) package, Madison, Wisconsin, USA, a wholly owned 
30 subsidiary of Oxford Molecular Group, Inc. Other techniques for alignment are described in 
Doolittle, supra. Preferably, an alignment program that permits gaps in the sequence is 
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utilized to align the sequences. The Smith- Waterman is one type of algorithm that permits 

gaps in sequence alignments. See Metk MoL Biol (1997) 70: 173-187. Also, the GAP 

program using the Needleman and Wunsch alignment method can be utilized to align 

sequences. An alternative search strategy uses MPSRCH software, which runs on a 

MASPAR computer. MPSRCH uses a Smith- Waterman algorithm to score sequences on a 

massively parallel computer. This approach improves ability to identify sequences that are 

distantly related matches, and is especially tolerant of small gaps and nucleotide sequence . 

errors. Amino acid sequences encoded by the provided polynucleotides can be used to 

search both protein and DNA databases. 

Results of individual and query sequence alignments can be divided into three 

categories, high similarity, weak similarity, and no similarity. Individual alignment results 

ranging from high similarity to weak similarity provide a basis for determining polypeptide 

activity and/or structure. Parameters for categorizing individual results include: percentage 

of the alignment region length where the strongest alignment is found, percent sequence 

identity, and p value. 

The percentage of the alignment region length is calculated by counting the number 
of residues of the individual sequence found in the region of strongest alignment, e.g., 
contiguous region of the individual sequence that contains the greatest number of residues 
that are identical to the residues of the corresponding region of the aligned query sequence. 
This number is divided by the total residue length of the query sequence to calculate a 
percentage. For example, a query sequence of 20 amino acid residues might be aligned with 
a 20 amino acid region of an individual sequence. The individual sequence might be 
identical to amino acid residues 5, 9* IS, and 17-19 of the query sequence. The- region of 
strongest alignment is thus the region stretching from residue 9-19, an 1 1 amino acid stretch. 
The percentage of the alignment region length is: 1 1 (length of the region of strongest 
alignment) divided by (query sequence length) 20 or 55%. 

Percent sequence identity is calculated by counting the number of amino acid 
matches between the query and individual sequence and dividing total number of matches by 
the number of residues of the individual sequences found in the region of strongest 
alignment Thus, the percent identity in the example above would be 10 matches divided by 
1 1 amino acids, or approximately, 90.9% 
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P value is the probability that the alignment was produced by chance^wf stog^ 
alignment, the p value can be calculated according to Karlin et al., Proc. Natl. Acad. Sci. 
(1990) 57:2264 and Kariine/ a/., Proc. Natl. Acad. Sci. (1993)90. The p value of multiple 
alignments using the same query sequence can be calculated using an heuristic approach 
i described in Altschul et aL. Nat. Genet. (1994) 6: 1 19. Alignment programs such as BLAST 
program can calculate the p value. 

Another factor to consider for determining identity or similarity is the location of the 
similarity or identity. Strong local alignment can indicate similarity even if the length of 
alignment is short. Sequence identity scattered throughout the length of the query sequence 
also can indicate a similarity between the query and profile sequences. The boundaries of 
the region where the sequences align can be determined according to Doolittle, supra; 
BLAST or FAST programs; or by determining the area where sequence identity is highest. 

High Similarity. In general, in alignment results considered to be of high similarity, 
the percent of the alignment region length is typically at least about 55% of total length 
query sequence; more typically, at least about 58%; even more typically; at least about 60% 
of the total residue length of the query sequence. Usually, percent length of the alignment 
region can be as much as about 62%; more usually, as much as about 64%; even more 
usually, as much as about 66%. Further, for high similarity, the region of alignment, 
typically, exhibits at least about 75% of sequence identity; more typically, at least about 
78%; even more typically; at least about 80% sequence identity. Usually, percent sequence 
identity can be as much as about 82%; more usually, as much as about 84%; even more 
usually, as much as about 86%. 

The p value is used in conjunction with these methods. If high similarity is found, 
the query sequence is considered to have high similarity with a profile sequence when the p 
value is less than or equal to about 10" 2 ; more usually; less than or equal to about 10 3 ; even 
more usually; less than or equal to about 10*. More typically, the p value is no more than 
about 10" s ; more typically; no more than or equal to about 10 '°; even more typically; no 
more than or equal to about 10 " for the query sequence to be considered high similarity. 

Weak Similarity. In general, where alignment results considered to be of weak 
similarity, there is no minimum percent length of the alignment region nor minimum length 
of alignment A better showing of weak similarity is considered when the region of 
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alignment is, typically, at least about IS amino acid residues in length; more typically, at 

least about 20; even more typically; at least about 25 amino acid residues in length. Usually, 

length of the alignment region can be as much as about 30 amino acid residues; more 

usually, as much as about 40; even more usually, as much as about 60 amino acid residues. 

5 Further, for weak similarity, the region of alignment, typically, exhibits at least about 35% of 

sequence identity; more typically, at least about 40%; even more typically; at least about 

45% sequence identity. Usually, percent sequence identity can be as much as about 50%; 

more usually, as much as about 55%; even more usually, as much as about 60%. 

If low similarity is found, the query sequence is considered to have weak similarity 

10 with a profile sequence when the p value is usually less than or equal to about 10~ 2 ; more 
usually; less than or equal to about 10°; even more usually; less than or equal to about 10^. 
More typically, the p value is no more than about 10" 5 ; more usually; no more than or equal 
to about 10" 10 ; even more usually; no more than or equal to about 1 0" 15 for the query sequence 
to be considered weak similarity. 

15 Similarity Determined bv Sequence Identity Alone, Sequence identity alone can be 

used to determine similarity of a query sequence to an individual sequence and can indicate 
the activity of the sequence. Such an alignment, preferably, permits gaps to align sequences. 
Typically, the query sequence is related to the profile sequence if the sequence identity over 
the entire query sequence is at least about 15%; more typically, at least about 20%; even 

20 more typically, at least about 25%; even more typically, at least about 50%. Sequence 

identity alone as a measure of similarity is most useful when the query sequence is usually, 
at least 80 residues in length; more usually, 90 residues; even more usually, at least 95 amino 
acid residues in length. More typically, similarity can be concluded based on sequence 
identity alone when the query sequence is preferably 100 residues in length; more preferably, 

25 120 residues in length; even more preferably, 1 50 amino acid residues in length. 

Determining Activity from Alignments with Profile and Multiple Align ed Sequences. 
Translations of the provided polynucleotides can be aligned with amino acid profiles that 
define either protein families or common motifs. Also, translations of the provided 
polynucleotides can be aligned to multiple sequence alignments (MSA) comprising the 

30 polypeptide sequences of members of protein families or motifs. Similarity or identity with 
profile sequences or MS As can be used to determine the activity of the gene products (e.g., 
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polypeptides) encoded by the provided polynucleotides or corresponding cDNA or genes. 

For example, sequences that show an identity or similarity with a chemokine profile or MSA 
can exhibit chemokine activities. 

Profiles can designed manually by (1) creating an MSA, which is an alignment of the 
5 amino acid sequence of members that belong to the family and (2) constructing a statistical 
representation of the alignment. Such methods are described, for example, in Birney et al., 
Nucl. Acid Res. (1996) 24(14): 2730-2739. MSAs of some protein families and motifs are 
publicly available. For example, http://genome. wustl.edu/Pfam/ includes MSAs of 547 
different families and motifs. These MSAs are described also in Sonnhammer et al., 
10 Proteins (1997) 28: 405-420. Other sources over the world wide web include the site at 
http://www.embl-heidelberg.de/argos/ali/nf,- html • alternatively, a message can be sent to 
ALI^MBL-HEIDELBERG.nF for the information. A brief description of these MSAs is 
reported in PascareUa et al„ Prot. Eng. (1996) PfJ) :249-251. Techniques for building 
profiles from MSAs are described in Sonnhammer et al., supra; Birney et al., supra; and 
1 5 "Computer Methods for Macromolecular Sequence Analysis," Methods in Enzymology 

(1996) 266, Doolittle, Academic Press, Inc., a division of Harcourt Brace & Co., San Diego, 
California, USA. 

Similarity between a query sequence and a protein family or motif can be determined 
by (a) comparing the query sequence against the profile and/or (b) aligning the query 
sequence with the members of the family or motif. Typically, a program such as Searchwise 
is used to compare the query sequence to the statistical representation of the multiple 
alignment, also known as a profile. The program is described in Birney et al., supra. Other 
techniques^ compare the sequence and profile are described in Sonnhammer et al., supra 
and Doolittle, supra. 

Next, methods described by Feng et al., J. Mol. Evol. (1987) 25:351 and Higgins et 
al., CABIOS (1989) 5:151 can be used align the query sequence with the members of a 
family or motif, also known as a MSA. Computer programs, such as PILEUP, can be used. 
See Feng et al., infra. In general, the following factors are used to determine if a similarity 
between a query sequence and a profile or MSA exists: (1) number of conserved residues 
30 found in the query sequence, (2) percentage of conserved residues found in the query 
sequence, (3) number of frameshifts, and (4) spacing between conserved residues. 
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Some alignment programs that both translate and align sequences can make any 

number of frameshifts when translating the nucleotide sequence to produce the best 

alignment The fewer frameshifts needed to produce an alignment, the stronger the 

similarity or identity between the query and profile or MS As. For example, a weak 

5 similarity resulting from no irameshifts can be a better indication of activity or. structure of a 

query sequence, than a strong similarity resulting from two frameshifts. Preferably, three or 

fewer frameshifts are found in an alignment; more preferably two or fewer frameshifts; even 

more preferably, one or fewer frameshifts; even more preferably, no frameshifts are found in 

an alignment of query and profile or MS As. 

10 Conserved residues are those amino acids found at a particular position in all or some 

of the family or motif members. For example, most chemokines contain four conserved 
cysteines. Alternatively, a position is considered conserved if only a certain class of amino 
acids is found in a particular position in all or some of the family members. For example, 
the N-tenninal position can contain a positively charged amino acid, such as lysine, arginine, 

IS orhistidine. 

Typically, a residue of a polypeptide is conserved when a class of amino acids or a 

single amino acid is found at a particular position in at least about 40% of all class members; 

more typically, at least about 50%; even more typically, at least about 60% of the members. 

Usually, a residue is conserved when a class or single amino acid is found in at least about 
20 70% of the members of a family or motif; more usually, at least about 80%; even more 

usually, at least about 90%; even more usually, at least about 95%. 

A residue is considered conserved when three unrelated amino acids are found at a 

particular position in the some or all of the members; more usually, two unrelated amino 

acids. These residues are conserved when the unrelated amino acids are found at particular 
25 positions in at least about 40% of all class member; more typically, at least about 50%; even 

more typically, at least about 60% of the members. Usually, a residue is conserved when a 

class or single amino acid is found in at least about 70% of the members of a family or motif; 

more usually, at least about 80%; even more usually, at least about 90%; even more usually, 

at least about 95%. 

30 A query sequence has similarity to a profile or MSA when the query sequence 

comprises at least about 25% of the conserved residues of the profile or MSA; more usually, 
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at least about 30%; even more usually; at least about 40%. Typically, the query sequence 
has a stronger similarity to a profile sequence or MSA when the query sequence comprises at 
least about 45% of the conserved residues of the profile or MSA; more typically, at least 
about 50%; even more typically; at least about 55%. 

B - Screening Polynucleoti de and Amino Acid Sequences Against Protein 
Profiles 

The identify and function of the gene that correlates to a polynucleotide described 
herein can be determined by screening the polynucleotides or their corresponding amino acid 
sequences against profiles of protein families. Such profiles focus on common structural 
motifs among proteins of each family. Publicly available profiles are described above in 
Section I VA. Additional or alternative profiles are described below. 

Incomparing a novel polynucleotide with known sequences, several alignment tools 
are available. Examples include PileUp, which creates a multiple sequence alignment, and is 
described in Feng et al., J. Mol. Evol. (1987) 2J:351. Another method, GAP, uses the 
alignment method of Needleman etal.,J. Mol. Biol. (1970) 45:443. GAP is best suited for 
global alignment of sequences. A third method, BestFit, functions by inserting gaps to 
maximize the number of matches using the local homology algorithm of Smith et al., Adv. 
Appl. Math. (1 98 1) 2:482. Exemplary protein profiles are provided below and in the 
examples. 

Chemokines. Chemokines are a family of proteins that have been implicated in 
lymphocyte trafficking, inflammatory diseases, angiogenesis, hematopoiesis, and viral 
infection. See, for example, Rollins, Blood (1997) 90(3):9O9-92Z, and Wells etal.,J. Leuk. 
Biol. (1997) 67:545-550. U.S. Patent No. 5,605,817 discloses DNA encoding a chemokine 
expressedin fetal spleen. U.S. Patent No. 5,656,724 discloses chemokine-like proteins and 
methods of use. U.S. Patent No. 5,602,008 discloses DNA encoding a chemokine expressed 
by liver. 

Chemokine mutants are polypeptides having an amino acid sequence that possesses 
at least one amino acid substitution, addition, or deletion as compared to native chemokines. 
Fragments possess the same amino acid sequence of the native chemokines; mutants can 
lack the amino and/or carboxyl terminal sequences. Fusions are mutants, fragments, or 
native chemokines that also include amino and/or carboxyl terminal amino acid extensions. 
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The number or type of the amino acid changes is not critical, nor is the length or 

number of the amino acid deletions, or amino acid extensions that are incorporated in the 

chemokines as compared to the native chemokine amino acid sequences. A polynucleotide 

encoding one of these variant polypeptides will retain at least about 80% amino acid identity 

5 with at least one known chemokine. Preferably, these polypeptides will retain at least about 

85% amino acid sequence identity, more preferably, at least about 90%; even more 

preferably, at least about 95%. In addition, the variants exhibit at least 80%; preferably 

about 90%; more preferably about 95% of at least one activity exhibited by a native 

chemokine, which includes immunological, biological, receptor binding, and signal 

10 transduction functions. 

Assays for chemotaxis relating to neutrophils are described in Walz et al., Biochem. 

Biophys. Res. Commun. (1987) 149:755, Yoshimura et al, Proc. Natl Acad. Scl (USA) 

(1987) 84:9233, and Schroder et al. J. Immunol (1987) 759:3474; to lymphocytes, Larsen et 
al, Science (1989) 243:1464, Caxr et al., Proc. Natl. Acad Scl (USA) (1994) 97:3652; to 

15 tumor-infiltrating lymphocytes, Liao et al., J. Exp. Med (1995). 182: 1301; to hematopoietic 
progenitors, Aiuti et al., J. Exp. Med (1997) 755:1 1 1; to monocytes, Valente et al, Biochem. 

(1988) 27:4162; and to natural killer cells, Loetscher et al., J. Immunol, (1996) 156:322, and 
Allavena et al^ Eur. J. Immunol (1994) 2¥:3233. 

Assays for determining the biological activity of attracting eosinophils are described 
20 inDahindenef at, J. Exp. Med (1994) 779:751, Weber et al, J. Immunol (1995) 754:4166, 
and Noso et al^ Biochem. Biophys. Res. Commun. (1994) 200: 1470; for attracting dendritic 
cells, Sozzani et al. 9 J. Immunol (1995) 755:3292; for attracting basophils, in Dahinden et 
al. 9 J. Exp. Med (1994) 779:751, Alam et a/„ J. Immunol (1994) 752:1298, Alam et al, J. 
Exp. Med. (1992) 77<J:781; and for activating neutrophils, Maghazaci et al., Eur. J. Immunol. 
25 (1996) 263 15, and Taub et at., J. Immunol (1995) 755:3877. Native chemokines can act as 
mitogens for fibroblasts, assayed as described in Mullenbach et al.,J. Biol Chem. (1986) 
261:719. 

Native chemokines exhibit binding activity with a number of receptors. Description 
of such receptors and assays to detect binding are described in, for example, Murphy et al, 
Science (1991) 255:1280; Combadiere et al, J. Biol Chem. (1995) 270:29671; Daugherty et 
al., J. Exp. Med (1996) 7*5:2349; Samson etal, Biochem. (1996) 55:3362; Raport et al, J. 

24 



10 



15 



WO 99/33982 PCT/US98/27 
Biol. Chem. (1996) 277:17161; Combadiere et at., J. Leukoc. Biol (1996) 50:147; Baba e7 
al.,J. Biol. Chem. (1997) 25: 14893; Yosida e/ a/., y. Ti/W. CAem. (1997) 272:13803; 
Arvannitakis et al. Nature (1997) 56*5:347, and other assays are known in the art. 

Assays for kinase activation of chemokines are described by Yen et al., J. Leukoc. 
Biol. (1997) 67:529; Dubois etal.. J. Immunol (1996) /J* 1 356; Turner et al, J. Immunol 
(1995) 755:2437. Assays for inhibition of angiogenesis or cell proliferation are described in 
Maione et al. Science (1990) 247 HI. Glycosaminoglycan production can be induced by 
native chemokines, assayed as described in Castor et al.,Proc. Natl Acad Sci. (USA) (1983) 
80:765. Chemokine-mediated histamine release from basophils is assayed as described in 
Dahinden et al, J. Exp. Med. (1989) 770:1787; and White et al, Immunol Lett. (1989) 
22:151. Heparin binding is described in Luster et al., J. Exp. Med (1995) 182:2X9. 

Chemokines can possess dimerization activity, which can be assayed according to 
Burrows et al., Biochem. (1994) 55: 12741; and Zhang etal., Mol. Cell. Biol. (1995) 75:4851. 
Native chemokines can play a role in the inflammatory response of viruses. This activity 
can be assayed as described in Bleul etal, Nature (1996) 382:829; and Oberlin et al., Nature 
(1996)552:833. Exocytosis of monocytes can be promoted by native chemokines. The 
assay for such activity is described in Uguccioni et al., Eur. J. Immunol (1995) 25:64. 
Native chemokines also can inhibit hematopoietic stem cell proliferation. The method for 
testing for such activity is repotted in Graham et al., Nature (1 990) 344:442. 

Death Domain Proteins Several protein families contain death domain motifs 
(FeinstemandKimchLr7BS£e//ery(1995)2^42). Some death domain containing 
proteins are implicated in cytotoxic intracellular signaling (Cleveland etal, Cell (1995) 
*/:479, Paa etal, Science (1997) 27o*:l 11; Duan etal., Nature (1997) 555:86-89, and 
Chinnaiyari^ al, Science (1996) 27*990). U.S. Patent No. 5,563,039 describes a protein 
homologous to TRADD (Tumor Necrosis Factor Receptor- 1 Associated Death Domain 
containing protein), and modifications of the active domain of TRADD that retain the 
functional characteristics of the protein, as well as apoptosis assays for testing the function of 
such death domain containing proteins. U.S. Patent No. 5,658,883 discloses biologically 
active TGF-B 1 peptides. U.S. Patent No. 5,674,734 discloses RIP, which contains a C- 
30 terminal death domain and an N-terminal kinase domain. 
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Leukemia Inhibitory Factor (LIFV An LIF profile is constructed from sequences of 

leukemia inhibitor factor, CT-1 (cardiotrophin-1), CNTF (ciliary neurotrophic factor), OSM 

(oncostatin M), and IL-6 (interleukin-6). This profile encompasses a family of secreted 

cytokines that have pleiotropic effects on many cell types including hepatocytes, osteoclasts, 

5 neuronal cells and cardiac myocytes, and can be used to detect additional genes encoding 

such proteins. These molecules are all structurally related and share a common co-receptor 

gpl 30 which mediates intracellular signal transduction by cytoplasmic tyrosine kinases such 

as src. 

Novel proteins related to this family are also likely to be secreted, to activate gpl30 

10 and to function in the development of a variety of cell types. Thus new members of this 
family would be candidates to be developed as growth or survival factors for the cell types 
that they stimulate. For more details on this family of cytokines, see Pennica et al 9 Cytokine 
and Growth Factor Reviews (1996) 7:81-91. U.S. Patent No. 5,420,247 discloses LIF 
receptor and fusion proteins. U.S. Patent No. 5,443,825 discloses human LIF. 

15 Angioooietin. Angiopoietin-1 is a secreted ligand of the TTE-2 tyrosine kinase; it 

functions as an angiogenic factor critical for normal vascular development. Angiopoietin-2 is 
a natural antagonist of angiopoietin-1 and thus functions as an anti-angiogenic factor. These 
two proteins are structurally similar and activate the same receptor (Folkman et aL t Cell 
(1996) 87:1 153, and Davis et aL t Cell (1996) 57:1 161). The angiopoietin molecules are 

20 composed of two domains: a coiled-coil region and a region related to fibrinogen. The 
fibrinogen domain is found in many molecules including ficolin and tesascin, and is well 
defined structurally with many members. 

Receptor Protein-Tvrosine Kinases. Receptor Protein-Tyrosine Kinases or RPTKs 
are described in Lindberg, Anmc Rev. Cell Biol. (1994) 70:251-337. 

25 Growth Factors: (Epidermal Growth Factort EGF and (Fibroblast Growth Factor) 

FGF. For a discussion of growth factor superfamilies, see Growth Factors: A Practical 
Approach, (Appendix Al) (1993) McKay and Leigh, Oxford University Press, NY, 237-243. 
U.S. Patent No. 4,444,760 discloses acidic brain fibroblast growth factor, which is active in 
the promotion of cell division and wound healing. U.S. Patent No. 5,439,818 discloses DNA 

JO encoding human recombinant basic fibroblast growth factor, which is active in wound 

healing. U.S. Patent No. 5,604,293 discloses recombinant human basic fibroblast growth 
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factor, which is useful for wound healing. U.S. Patent No. 5,410,832 discloses brain-derived 
and recombinant acidic fibroblast growth factor, which act as mitogens for mesoderm and 
neuroectoderm-derived cells in culture, and promote wound healing in soft tissue, 
cartilaginous tissue and musculoskeletal tissue. U.S. Patent No. 5,387,673 discloses 
biologically active fragments of FGF. 

Proteins of theTNF Family A profile derived from the TNF family is created by 
aligning sequences of the following TNF family members: nerve growth factor (NGF), 
lymphotoxin, Fas ligand, tiimOr necrosis factor (TNFa), CD40 ligand, TRAIL, ox40 ligand, 
4-1BB ligand, CD27 ligand, and CD30 ligand. The profile is designed to identify sequences 
of proteins that constitute new members or homologues of this family of proteins. U.S. 
Patent No. 5,606,023 discloses mutant TNF proteins; U.S. Patent No. 5,597,899 and U.S. 
Patent No. 5,486,463 disclose TNF muteins; and U.S. Patent No. 5,652,353 discloses DNA 
encoding TNFa muteins. 

Members of the TNF family of proteins have been show in vitro to multimerize, as 
described in Burrows et al., Biochem. (1994) iJ:12741 and Zhang et al, Mol Cell Biol. 
(1995) 75:4851 and bind receptors as described in Browning et al., J. Immunol. (1994) 
247:1230, Androlewicz et al., J. Biol. CW(1992) 26*7:2542, and Crowe et al., Science 
(1994) 26V:707. 

In vivo, TNFs proteolyticaUy cleave a target protein as described in Kriegel et ai, 
Cell (1988) 53AS and Mohler et al, Nature (1994) J70:218 and demonstrate cell 
proliferation and differentiation activity. T-cell or thymocyte proliferation is assayed as 
described in Armitage et al., Eur. J. Immunol. (1992) 22:447; Current Protocols in 
Immunology, ed. SE. Coligan etal^ 3.1-3.19; Takai etal.,J. Immunol. (1986) 737:3494- 
3500, fiertagnoli etal.,J. Immunol. (1990) 745:1706, Bertagnoli etal.,J. Immunol. (1991) 
75J:327, Bertagnoli et al.,J. Immunol (1992) 149311%, and Bowman et dl.,J. Immunol. 
(1994) 752:1756. B cell proliferation and Ig secretion are assayed as described in 
Maliszewski, J. Immunol. (1990) 144.302%, and Assays for B Cell Function? In Vit™ 
Antibody Production Mond and Brunswick, Current Protocols in Immunol., Coligan Ed vol 
1 pp 3.8.1-3.8.16, John Wiley and Sons, Toronto 1994, Kehrl et al.. Science (1987) 2J*:1 144 
and Boussiotis et al., PNAS USA (1994) P7.7007. Other in vivo activities include 
upregulation of cell surface antigens, upregulation of costimulatory molecules, and cellular 
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aggregation/adhesion as described in Barrett et al. , J. Immunol. (1991) 74^:1722; Bjorck et 

aL, Eur. J. Immunol. (1993) 2J:177i; Clark et aL, Annu Rev. Immunol. (1991) 0:97; 

Ranheim etal., J. Exp. Med. (1994) 777:925; Yelling Immunol. (1994) 755:666; and Grass 

et aL. Blood (1994) 5^:2305. 

5 Proliferation and differentiation of hematopoietic and lymphopoietic cells has also 

been shown in vivo for TNFs, using assays for embryonic differentiation and hematopoiesis 

as described in Johansson et al. , Cellular Biology ( 1 995) 75:141, Keller et aL, Mol. Cell. 

Biol. (1993) 75:473, McClanahan et aL, Blood (1993) £7:2903 and using assays to detect 

stem cell survival and differentiation as described in Culture of Hematopoietic Cells, 

10 Freshney et al. eds, pp 1-21, 23-29, 139-162, 163-179, and 265-268, Wiley-Liss, Inc., New 
York, NY, 1994, and Hirajama et aL, PNAS USA (1992) 59:5907. 

In vivo activities of TNFs also include lymphocyte survival and apoptosis, assayed as 
described in Darzynkewicz et aL, Cytometry (1992) 73:795; Gorczca et al., Leukemia (1993) 
7:659; Itoh et al., Cell (1991) 66:233; Zachaiduk, J. Immunol. (1990) 145:4037; Zamai et 

15 aL, Cytometry (1993) 74:891; and Gorczyca et aL, Int'U. Oncol. (1992) 7:639. Some 
members of the TNF family are cleaved from the cell surface; others remain membrane 
bound. The three-dimensional structure of TNF is discussed in Sprang and Eck, Tumor 
Necrosis Factors; supra. 

TNF proteins include a transmembrane domain. The protein is cleaved into a shorter 

20 soluble version, as described in Kriegler et aL, Cell (1988) 55:45, Perez et aL, Cell (1990) 
63*25 1, and Shaw et aL, Cell (1986) 45:659. The transmembrane domain is between amino 
acid 46 and 77 and the cytoplasmic domain is between position 1 and 45 on the human form 
of TNFa. The 3-dimensional motifs of TNF include a sandwich of two pleated p sheets. 
Each sheet is composed of anti-parallel p strands, p strands facing each other on opposite 

25 sites of the sandwich are connected by short polypeptide loops, as described in Van Ostade et 
al., Protein Engineering (1994) 7(1):S, and Sprang et al., Tumor Necrosis Factors; supra. 
Residues of the TNF family proteins that are involved in the P sheet secondary structure 
have been identified as described in Van Ostade et aL. Protein Eng. (1994) 7(1):5, and 
Sprang et aL, supra. 

30 TNF receptors are disclosed in U.S. Patent No. 5,395,760. A profile derived from the 

TNF receptor family is created by aligning sequences of the TNF receptor family, including 
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Apol/Fas, TNFR I and II, death receptor 3 (DR3), CD40, ox40, CD27, and CD30. Thus, the 
profile is designed to identify from the polynucleotides of the invention sequences of 
proteins that constitute new members or homologues of this family of proteins. 

Tumor necrosis factor receptors exist in two forms in humans: p55 TNFR and p75 
TNFR, both of which provide intracellular signals upon binding with a ligand. The 
extracellular domains of these receptor proteins are cysteine rich. The receptors can remain 
membrane bound, although some forms of the receptors are cleaved forming soluble 
receptors. The regulation; diagnostic, prognostic, and therapeutic value of soluble TNF 
receptors is discussed in Aderka, Cytokine and Growth Factor Reviews, (1996) 7^:231. 

PDGF Family. U.S. Patent No. 5,326,695 discloses platelet derived growth factor 
agonists; bioactive portions of PDGF-B are used as agonists. U.S. Patent No. 4,845,075 
discloses biologically active B-chain homodimers, and also includes variants and derivatives 
of the PDGF-B chain. U.S. Patent No. 5,128,321 discloses PDGF analogs and methods of 
use. Proteins having the same bioactivity as PDGF are disclosed, including A and B chain 
15 proteins. 

Kinase (Including MKK\ Family U.S. Patent No. 5,650,501 discloses 
serine/threonine kinase, associated with mitotic and meiotic cell division; the protein has a 
kinase domain in its N-tenninal and 3 PEST regions in the C-terminus. U.S. Patent No. 
5,605,825 discloses human PAK65, a serine protein kinase. 
20 The foregoing discussion provides a few examples of the protein profiles that can be 

compared with the polynucleotides of the invention. One skilled in the art can use these and 
other protein profiles to identify the genes that correlate with the provided polynucleotides. 
C - ^ Identification of Secret ed & Membrane-Bound Polypeptides 
Both-secreted and membrane-bound polypeptides of the present invention are of 
particular interest For example, levels of secreted polypeptides can be assayed in body 
fluids that are convenient, such as blood, urine, prostatic fluid and semen. Membrane-bound 
polypeptides are useful for constructing vaccine antigens or inducing an immune response. 
Such antigens would comprise all or part of the extracellular region of the membrane-bound 
polypeptides. Because both secreted and membrane-bound polypeptides comprise a 
fragment of contiguous hydrophobic amino acids, hydrophobicity predicting algorithms can 
be used to identify such polypeptides. 
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A signal sequence is usually encoded by both secreted and membrane-bound 

polypeptide genes to direct a polypeptide to the surface of the cell. The signal sequence 

usually comprises a stretch of hydrophobic residues. Such signal sequences can fold into 

helical structures. Membrane-bound polypeptides typically comprise at least one 

5 transmembrane region that possesses a stretch of hydrophobic amino acids that can 

transverse the membrane. Some transmembrane regions also exhibit a helical structure. 

Hydrophobic fragments within a polypeptide can be identified by using computer 

algorithms. Such algorithms include Hopp & Woods a Proc. NatL Acad ScL USA (1981) 

75:3824-3828; Kyte & Doolittle, J. Mol Biol (1982) 157: 105-132; and RAOAR algorithm, 

1 0 Degli Esposti et aL, Eur. J. Biochem. ( 1 990) 190: 207-2 1 9. 

Another method of identifying secreted and membrane-bound polypeptides is to 
translate the polynucleotides of the invention in all six frames and determine if at least 8 
contiguous hydrophobic amino acids are present Those translated polypeptides with at least 
8; more typically, 10; even more typically, 12 contiguous hydrophobic amino acids are 

1 5 considered to be either a putative secreted or membrane bound polypeptide. Hydrophobic 
amino acids include alanine, glycine, histidine, isoleucine, leucine, lysine, methionine, 
phenylalanine, proline, threonine, tryptophan, tyrosine, and valine. 

IV, Identification of the Function of an Expression Product of a Full-Length Gene 

20 Corresponding to a Polynucleotide 

Ribozymes, antisense constructs, and dominant negative mutants can be used to 
determine function of the expression product of a gene corresponding to a polynucleotide 
provided herein. These methods and compositions are particularly useful where the provided 
novel polynucleotide exhibits no significant or substantial homology to a sequence encoding 

25 a gene of known function. Antisense molecules and ribozymes can be constructed from 
synthetic polynucleotides. Typically, the phosphoramidite method of oligonucleotide 
synthesis is used. See Beaucage et aL, Tet Lett. (1981) 22:1859 and U.S. Patent No. 
4,668,777. Automated devices for synthesis are available to create oligonucleotides using 
this chemistry. Examples of such devices include Biosearch 8600, Models 392 and 394 by 

30 Applied Biosy stems, a division of Perkin-Elmer Corp., Foster City, California, USA; and 
Expedite by Perceptive Biosystems, Framingham, Massachusetts, USA. Synthetic RNA, 
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phosphate analog oligonucleotides, and chemically derivatized oligonucleotides can also be 
produced, and can be covalently attached to other molecules. RNA oligonucleotides can be 
synthesized, for example, using RNA phosphoramidites. This method can be performed on 
an automated synthesizer, such as Applied Biosystems, Models 392 and 394, Foster City, 
California, USA. See Applied Biosystems User Bulletin 53 and Ogilvie et al.. Pure & 
Applied Chem. (1987) 59:325. 

Phosphorothioate oligonucleotides can also be synthesized for antisense construction. 
A sulfurizing reagent, such as tetraemylmiruam disulfide (TETD) in acetonitriie can be used 
to convert the internucleotide cyanoethyl phosphite to the phosphorothioate triester within 1 5 
minutes at room temperature. TETD replaces the iodine reagent, while all other reagents 
used for standard phosphoramidite chemistry remain the same. Such a synthesis method 
be automated using Models 392 and 394 by Applied Biosystems, for example. 

Oligonucleotides of up to 200 nucleotides can be synthesized, more typically, 100 
nucleotides, more typically 50 nucleotides; even more typically 30 to 40 nucleotides. These 
synthetic fragments can be annealed and ligated together to construct larger fragments. See, 
for example, Sambrook et al., supra. 
A. Ribozvmes 

Trans-cleaving catalytic RNAs (ribozymes) are RNA molecules possessing 
endoribonuclease activity. Ribozymes are specifically designed for a particular target, and 
the target message must contain a specific nucleotide sequence. They are engineered to 
cleave any RNA species site-specifically in the background of cellular RNA. The cleavage 
event renders the mRNA unstable and prevents protein expression. Importantly, ribozymes 
can be used to inhibit expression of a gene of unknown function for the purpose of 
deternamii% its function in an in vitro or in vivo context, by detecting the phenotypic effect 
25 One commonly used ribozyme motif is the hammerhead, for which the substrate 

sequence requirements are minimal. Design of the hammerhead ribozyme is disclosed in 
Usman et al., Current Opin. Struct. Biol (1996) 5:527. Usman also discusses the 
therapeutic uses of ribozymes. Ribozymes can also be prepared and used as described in 
Longe/ al., FASEBJ. (1993) 7:25; Symons,^ Rev. Biochem. (1992) 67:641 ; Perrotta et 
al. Biochem. (1992) 57:16; Ojwang et al, Proc. Natl Acad. Set (USA) (1992) 6^:10802; 
and U.S. Patent No. 5,254,678. Ribozyme cleavage of HTV-I RNA is described in U.S. 
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Patent No. 5,144,019; methods of cleaving RNA using ribozymes is described in U.S. 
Patent No. 5,1 16,742; and methods for increasing the specificity of ribozymes are described 
in U.S. Patent No. 5,225,337 and Koizumi et al., Nucleic Acid Res. ( 1 989) / 7:7059. 
Preparation and use of ribozyme fragments in a hammerhead structure are also described by 
5 Koizumi el al. , Nucleic Acids Res. ( 1 989) / 7:7059. Preparation and use of ribozyme 

fragments in a hairpin structure are described by Chowrira and Burke, Nucleic Acids Res. 
(1992)20:2835. Ribozymes can also be made by rolling transcription as described in 
DaubendiekandKool,//a/. Biotechnol. (1997) J5(3):273. 

The hybridizing region of the ribozyme can be modified or can be prepared as a 

10 branched structure as described in Horn and Urdea, Nucleic Acids Res. ( 1 989) 1 7:6959. The 
basic structure of the ribozymes can also be chemically altered in ways familiar to those 
skilled in the art, and chemically synthesized ribozymes can be administered as synthetic 
oligonucleotide derivatives modified by monomeric units. In a therapeutic context, liposome 
mediated delivery of ribozymes improves cellular uptake, as described in Birikh et al, Eur. 

1 5 J. Biochem. (1997) 245: 1 . 

Using the polynucleotide sequences of the invention and methods known in the art, 
ribozymes are designed to specifically bind and cut the corresponding mRNA species. 
Ribozymes thus provide a means to inhibit the expression of any of the proteins encoded by 
the disclosed polynucleotides or their full-length genes. The full-length gene need not be 

20 known in order to design and use specific inhibitory ribozymes. In the case of a 

polynucleotide or full-length cDNA of unknown function, ribozymes corresponding to that 
nucleotide sequence can be tested in vitro for efficacy in cleaving the target transcript. 
Those ribozymes that effect cleavage in vitro are further tested in vivo. The ribozyme can 
also be used to generate an animal model for a disease, as described in Birikh etal, supra. 

25 An effective ribozyme is used to determine the function of the gene of interest by blocking 
its transcription and detecting a change in the cell. Where the gene is found to be a mediator 
in a disease, an effective ribozyme is designed and delivered in a gene therapy for blocking 
transcription and expression of the gene. 

Therapeutic and functional genomic applications of ribozymes proceed beginning 

30 with knowledge of a portion of the coding sequence of the gene to be inhibited. Thus, for 
many genes, a partial polynucleotide sequence provides adequate sequence for constructing 
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an effective ribozyme. A target cleavage site is selected in the target sequenL^r™ 0 
ribozyme is constructed based on the 5' and 3' nucleotide sequences that flank the cleavage 
site. Retroviral vectors are engineered to express monomeric and multimeric hammerhead 
ribozymes targeting the mRNA of the target coding sequence. These monomeric and 
5 multimeric ribozymes are tested in vitro for an ability to cleave the target mRNA. A cell line 
is stably transduced with the retroviral vectors expressing the ribozymes, and the 
transduction is confirmed by Northern blot analysis and reverse-transcription polymerase 
chain reaction (RT-PCR). The cells are screened for inactivation of the target mRNA by 
such indicators as reduction of expression of disease markers or reduction of the gene 
1 0 product of the target mRNA. 

B. Antisense 

Antisense nucleic acids are designed to specifically bind to RNA, resulting in the 
formation of RNA-DNA or RNA-RNA hybrids, with an arrest of DNA replication, reverse 
transcription or messenger RNA translation. Antisense polynucleotides based on a selected 
polynucleotide sequence can interfere with expression of the corresponding gene. Antisense 
polynucleotides are typically generated within the cell by expression from antisense 
constructs that contain the antisense strand as the transcribed strand. Antisense 
polynucleotides based on the disclosed polynucleotides will bind and/or interfere with the 
translation of mRNA comprising a sequence complementary to the antisense polynucleotide. 
The expression products of control cells and cells treated with the antisense construct are 
compared to detect the protein product of the gene corresponding to the polynucleotide upon 
which the antisense construct is based. The protein is isolated and identified using routine 
biochemicaLmethods. 

Onezrationale for using antisense methods to determine the function of the gene 
corresponding to a disclosed polynucleotide is the biological activity of antisense 
therapeutics. Antisense therapy for a variety of cancers is in clinical phase and has been 
discussed extensively in the Uterature. Reed reviewed antisense therapy directed at the Bcl-2 
gene in tumors; gene transfer-mediated overexpression of Bcl-2 in tumor cell lines conferred 
resistance to many types of cancer drugs. (Reed, J.C., N.C.I. (1997) SP:988). The potential 
for clinical development of antisense inhibitors of ras is discussed by Cowsert, L.M., Ami- 
Cancer Drug Design (1997) /2:359. Additional important antisense targets include ' 
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leukemia (Geurtz, A.M., Anti-Cancer Drug Design (1997) 72:341); human C-ref kinase 

(Monia, B.P., Anti-Cancer Drug Design (1997) 72:327); and protein kinase C (McGraw et 

aL. Anti-Cancsr Drug Design (1997) 72:315. 

Given the extensive background literature and clinical experience in antisense 
therapy, one skilled in the art can use selected polynucleotides of the invention as additional 
potential therapeutics. The choice of polynucleotide can be narrowed by first testing them 
for binding to "hot spot" regions of the genqme of cancerous cells. If a polynucleotide is 
identified as binding to a "hot spot", testing the polynucleotide as an antisense compound in 
the corresponding cancer cells clearly is warranted. 

Ogunbiyi et aL, Gastroenterology (1997) 113(3)J6\ describe prognostic use of 
allelic loss in colon cancer; Barks et aL, Genes, Chromosomes, and Cancer (1997) 19(4):27$ 
describe increased chromosome copy number detected by FISH in malignant melanoma; 
Nishizake et aL, Genes. Chromosomes, and Cancer (1997) J9(4):267 describe genetic 
alterations in primary breast cancer and their metastases and direct comparison using 
modified comparative genome hybridization; and Elo et aL, Cancer Research (1997) 
57(7(^:3356 disclose that loss of heterozygosity at 16z24.1-q24.2 is significantly associated 
with metastatic and aggressive behavior of prostate cancer. 

C. Dominant Negative Mutations 

As an alternative method for identifying function of the gene corresponding to a 
polynucleotide disclosed herein, dominant negative mutations are readily generated for 
corresponding proteins that are active as homomultimers. A mutant polypeptide will interact 
with wild-type polypeptides (made from the other allele) and form a non-functional 
multimer. Thus, a mutation is in a substrate-binding domain, a catalytic domain, or a 
cellular localization domain. Preferably, the mutant polypeptide will be overproduced. 
Point mutations are made that have such an effect In addition, fusion of different 
polypeptides of various lengths to the terminus of a protein can yield dominant negative 
mutants. General strategies are available for making dominant negative mutants (see, e.g., 
Herskowitz, Nature (1987) 329:219). Such techniques can be used to create loss of function 
mutations, which are useful for determining protein function. 
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V - Construction of Polypeptides of the Tnrenrinn „ n d Variant. Th^f 

The polypeptides of the invention include those encoded by the disclosed 
polynucleotides. These polypeptides can also be encoded by nucleic acids that, by virtue of 
the degeneracy of the genetic code, are not identical in sequence to the disclosed 
5 polynucleotides. Thus, the invention includes within its scope a polypeptide encoded by a 
polynucleotide having the sequence of any one of SEQ ID NOS: 1-844 or a variant thereof. 

In general, the term "polypeptide" as used herein refers to both the full length 
polypeptide encoded oy the recited polynucleotide, the polypeptide encoded by the gene 
represented by the recited polynucleotide, as well as portions or fragments thereof. 
10 "Polypeptides" also includes variants of the naturally occurring proteins, where such 

variants are homologous or substantially similar to the naturally occumng protein, and can 
be of an origin of the same or different species as the naturally occurring protein (e.g., 
human, murine, or some other species that naturally expresses the recited polypeptide, 
usually a mammalian species). In general, variant polypeptides have a sequence that has at 
least about 80%, usually at least about 90%, and more usually at least about 98% sequence 
identity with a differentially expressed polypeptide of the invention, as measured by BLAST 
using the parameters described above. The variant polypeptides can be naturally or non- 
naturally glycosylated, Le., the polypeptide has a glycosylate pattern that differs from the 
glycosylate pattern found in the corresponding naturally occurring protein. 

; The invention also encompasses homologs of the disclosed polypeptides (or 
fragments thereof) where the homologs are isolated from other species, Le. other animal or 
plant species, where such homologs, usually mammalian species, e.g. rodents, such as mice, 
rats; domestic animals, e.g., horse, cow, dog, cat; and humans. By homolog is meant a 
polypeptide having at least about 35%, usually at least about 40% and more usually at least 
about 60% amino acid sequence identity a particular differentially expressed protein as 
identified above, where sequence identity is determined using the BLAST algorithm, with 
the parameters described supra. 

In general, the polypeptides of the subject invention are provided in a non-naturally 
occurring environment, e.g. are separated from their naturally occurring environment. In 
certain embodiments, the subject protein is present in a composition that is enriched for the 
protein as compared to a control. As such, purified polypeptide is provided, where by 
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purified is meant that the protein is present in a composition that is substantially free of non- 

differentially expressed polypeptides, where by substantially free is meant that less than 

90%, usually less than 60% and more usually less than 50% of the composition is made up 

of non-differentially expressed polypeptides. 

5 Also within the scope of the invention are variants; variants of polypeptides include 

mutants, fragments, and fusions. Mutants can include amino acid substitutions, additions or 

deletions. The amino acid substitutions can be conservative amino acid substitutions or 

substitutions to eliminate non-essential amino acids, such as to alter a glycosylation site, a 

phosphorylation site or an acetylation site, or to minimize misfolding by substitution or 

10 deletion of one or more cysteine residues that are not necessary for function. Conservative 
amino acid substitutions are those that preserve the general charge, 
hydrophobicity/hydrophilicity, and/or steric bulk of the amino acid substituted. For 
example, substitutions between the following groups are conservative: Gly/Ala, Val/Ile/Leu, 
Asp/Glu, Lys/Arg, Asn/Gln, Ser/Cys, Thr, and Phe/Trp/Tyr. 

15 Variants can be designed so as to retain biological activity of a particular region of 

the protein (e.g. y a functional domain and/or, where the polypeptide is a member of a protein 
family, a region associated with a consensus sequence). In a non-limiting example, Osawa et 
aL, Biochenu MoL Int. (1994) 54:1003, discusses the actin binding region of a protein from 
several different species. The actin binding regions of the these species are considered 

20 homologous based on the fact that they have amino acids that fall within "homologous 

residue groups." Homologous residues are judged according to the following groups (using 
single letter amino acid designations): STAG; ILVMF; HRK; DEQN; and FYW. For 
example, and S, a T, an A or a G can be in a position and the function (in this case actin 
. binding) is retained. 

25 Additional guidance on amino acid substitution is available from studies of protein 

evolution. Go et al, Int. J. Peptide Protein Res. (1980) 152.1 1, classified amino acid residue 
sites as interior or exterior depending on their accessibility. More frequent substitution on 
exterior sites was confirmed to be general in eight sets of homologous protein families 
regardless of their biological functions and the presence or absence of a prosthetic group. 

30 Virtually all types of amino acid residues had higher mutabilities on the exterior than in the 
interior. No correlation between mutability and polarity was observed of amino acid 
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residues in the interior and exterior, respectively. Amino acid residues were classified into 

one of three groups depending on their polarity: polar (Arg, Lys, His, Gin, Asn, Asp, and 
GIu); weak polar (Ala, Pro, Gly, Thr, and Ser), and nonpolar (Cys, VaL Met, He, Leu, Phe, 
Tyr, and Trp). Amino acid replacements during protein evolution were very conservative: 
88% and 76% of them in the interior or exterior, respectively, were within the same group of 
the three. Inter-group replacements are such that weak polar residues are replaced more 
often by nonpolar residues in the interior and more often by polar residues on the exterior. 

Additional guidance for production of polypeptide variants is pro vided in Querol et 
al., Prot. Eng. (1996) 9:265, which provides general rules for amino acid substitutions to 
enhance protein thermostability. New glycosylate sites can be introduced as discussed in 
Olsen and Thomsen, J. Gen. Microbiol. (1991) J 37:579. An additional disulfide bridge can 
be introduced, as discussed by Perry and Wetzel, Science (1984) 226:555; Pantoliano et al., 
Biochemistry (1987) 26:2077; Matsumura et al., Nature (1989) 342:291; Nishikawa et al.. 
Protein Eng. (1990) 5:443; Takagi et al., J. Biol. Chem. (1990) 2*5:6874; Clarke et al., 
Biochemistry (1993) J2:4322; and Wakarchuk et al., Protein Eng. (1994) 7:1379. Metal 
binding sites can be introduced, according to Toma et al., Biochemistry (1991) J0:97, and 
Haezerbrouck et al., Protein Eng. (1 993) 5:643. Substitutions with prolines in loops can be 
made according to Masul et al., Appl. Env. Microbiol. (1994) 50:3579; and Hardy et al., 
FEBSLett.317.%9. 

Cysteine-depleted muteins are considered variants within the scope of the invention. 
These variants can be constructed according to methods disclosed in U.S. Patent No. 
4,959,3 14, which discloses substitution of cysteines with other amino acids, and methods 
for assaying biological activity and effect of the substitution. Such methods are suitable for 
proteins according to this invention that have cysteine residues suitable for such 
substitutions, for example to eliminate disulfide bond formation. 

Variants also include fragments of the polypeptides disclosed herein, particularly 
biologically active fragments and/or fragments corresponding to functional domains. 
Fragments of interest will typically be at least about 10 aa to at least about 15 aa in length, 
usually at least about 50 aa in length, and can be as long as 300 aa in length or longer, but 
will usually not exceed about 1000 aa in length, where the fragment will have a stretch of 
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amino acids that is identical to a polypeptide encoded by a polynucleotide having a 

sequence of any SEQ ID NOS: 1-844, or a homolog thereof. 

The protein variants described herein are encoded by polynucleotides that are within 

the scope of the invention. The genetic code can be used to select the appropriate codons to 

5 construct the corresponding variants. 

VI. Computer-Related Embodiments 

In general, a library of polynucleotides is a collection of sequence information, which 
information is provided in either biochemical form (e.g., as a collection of polynucleotide 
10 molecules), or in electronic form (e.g., as a collection of polynucleotide sequences stored in a 
computer-readable form, as in a computer system and/or as part of a computer program). 
The sequence information of the polynucleotides can be used in a variety of ways, e.g., as a 
resource for gene discovery, as a representation of sequences expressed in a selected cell 
type (e.g-, cell type markers), and/or as markers of a given disease or disease state. In 

1 5 general, a disease marker is a representation of a gene product that is present in all affected 
by disease either at an increased or decreased level relative to a normal cell (e.g., a cell of the 
same or similar type that is not substantially affected by disease). For example, a 
polynucleotide sequence in a library can be a polynucleotide that represents an mRNA, 
polypeptide, or other gene product encoded by the polynucleotide, that is either 

20 overexpressed or underexpressed in a breast ductal cell affected by cancer relative to a 
normal (ie., substantially disease-free) breast cell. 

The nucleotide sequence information of the library can be embodied in any suitable 
form, e.^., electronic or biochemical forms. For example, a library of sequence information 
embodied in electronic form includes an accessible computer data file (or, in biochemical 

25 form, a collection of nucleic acid molecules) that contains the representative nucleotide 

sequences of genes that are differentially expressed (e.g., overexpressed or underexpressed) 
as between, for example, i) a cancerous cell and a normal cell; ii) a cancerous cell and a 
dysplastic cell; iii) a cancerous cell and a cell affected by a disease or condition other than 
cancer; iv) a metastatic cancerous cell and a normal cell and/or non-metastatic cancerous 

30 cell; v) a malignant cancerous cell and a non-malignant cancerous cell (or a normal cell) 

and/or vi) a dysplastic cell relative to a normal cell. Other combinations and comparisons of 
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cells affected by various diseases or stages of disease will be readily apparent to the 

ordinarily skilled artisan. Biochemical embodiments of the library include a collection of 
nucleic acids that have the sequences of the genes in the library, where the nucleic acids can 
correspond to the entire gene in the library or to a fragment thereof, as described in greater 
detail below. 

The polynucleotide libraries of the subject invention include sequence information of 
a plurality of polynucleotide sequences, where at least one of the polynucleotides has a 
sequence of any of SEQ ID NOS: 1-844. By plurality is meant at least 2, usually at least 3 
and can include up to all of SEQ ID NOS: 1-844. The length and number of polynucleotides 
in the library will vary with the nature of the library, e.g., if the library is an oligonucleotide 
array, a cDNA array, a computer database of the sequence information, etc. 

Where the library is an electronic library, the nucleic acid sequence information can 
be present in a variety of media. "Media" refers to a manufacture, other than an isolated 
nucleic acid molecule, that contains the sequence information of the present invention. Such 
a manufacture provides the genome sequence or a subset thereof in a form that can be 
examined by means not directly applicable to the sequence as it exists in a nucleic acid. For 
example, the nucleotide sequence of the present invention, e.g. the nucleic acid sequences of 
any of the polynucleotides of SEQ ID NOS: 1-844, can be recorded on computer readable 
media, e.g. any medium that can be read and accessed directly by a computer. Such media 
include, but are not limited to: magnetic storage media, such as a floppy disc, a hard disc 
storage medium, and a magnetic tape; optical storage media such as CD-ROM; electrical 
storage media such as RAM and ROM; and hybrids of these categories such as 
magnetic/pjKical storage media. One of skill in the art can readily appreciate how any of the 
presently known computer readable mediums can be used to create a manufacture 
comprising a recording of the present sequence information. "Recorded" refers to a process 
for storing information on computer readable medium, using any such methods as known in 
the art. Any convenient data storage structure can be chosen, based on the means used to 
access the stored information. A variety of data processor programs and formats can be used 
for storage, e.g. word processing text file, database format, etc. In addition to the sequence 
information, electronic versions of the libraries of the invention can be provided in 
conjunction or connection with other computer-readable information and/or other types of 
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computer-readable files (e.g., searchable files, executable files, etc t including, but not limited 

to, for example, search program software, etc.). 

By providing the nucleotide sequence in computer readable form, the information can 

be accessed for a variety of purposes. Computer software to access sequence information is 

5 publicly available. For example, the BLAST (Altschul et a!., supra.) and BLAZE (Brutlag 

et al Comp. Chem. (1993) 17:203) search algorithms on a Sybase system can be used 

. identify open reading frames (ORFs) within the genome that contain homology to ORFs 

from other organisms. 

As used herein, "a computer-based system" refers to the hardware means, software 

10 means, and data storage means used to analyze the nucleotide sequence information of the 

present invention. The minimum hardware of the computer-based systems of the present 

invention comprises a central processing unit (CPU), input means, output means, and data 

storage means. A skilled artisan can readily appreciate that any one of the currently 

available computer-based system are suitable for use in the present invention. The data 

15 storage means can comprise any manufacture comprising a recording of the present sequence 

information as described above, or a memory access means that can access such a 

manufacture. 

"Search means" refers to one or more programs implemented on the computer-based 
system, to compare a target sequence or target structural motif with the stored sequence 
20 information. Search means are used to identify fragments or regions of the genome that 
match a particular target sequence or target motif. A variety of known algorithms are 
publicly known and commercially available, e.g. MacPattern (EMBL), BLASTN and 
BLASTX (NCBI). A "target sequence" can be any DNA or amino acid sequence of six or 
more nucleotides or two or more amino acids, preferably from about 10 to 100 amino acids 
25 or from about 30 to 300 nucleotide residues. 

A "target structural motif," or "target motif," refers to any rationally selected 
- sequence or combination of sequences in which the sequence(s) are chosen based on a 
three-dimensional configuration that is formed upon the folding of the target motif, or on 
consensus sequences of regulatory or active sites. There are a variety of target motifs known 
30 in the art Protein target motifs include, but arc not limited to, enzyme active sites and signal 
sequences. Nucleic acid target motifs include, but are not limited to, hairpin structures, 
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promoter sequences and other expression elements such as binding sites for transcription 

factors. 

A variety of structural formats for the input and output means can be used to input 
and output the information in the computer-based systems of the present invention. One 
format for an output means ranks fragments of the genome possessing varying degrees of 
homology to a target sequence or target motif. Such presentation provides a skilled artisan 
with a ranking of sequences and identifies the degree of sequence similarity contained in the 
identified fragment. 

A variety of comparing means can be used to compare a target sequence or target 
motif with the data storage means to identify sequence fragments of the genome. A skilled 
artisan can readily recognize that any one of the publicly available homology search 
programs can be used as the search means for the computer based systems of the present 
invention. 

As discussed above, the "library" of the invention also encompasses biochemical 
libraries of the polynucleotides of SEQ ID NOS.1-844, e.g., collections of nucleic acids 
representing the provided polynucleotides. The biochemical libraries can take a variety of 
forms, e.g., a solution of cDNAs, a pattern of probe nucleic acids stably associated with a 
surface of a solid support (Le., an array) and the like. Of particular interest are nucleic acid 
arrays in which one or more of SEQ ID NOS.1-844 is represented on the array. By array is 
meant a an article of manufacture that has at least a substrate with at least two distinct 
nucleic acid targets on one of its surfeces, where the number of distinct nucleic acids can be 
considerably higher, typically being at least 10 nt, usually at least 20 nt and often at least 25 
ttt " A of different a™ 1 * fonnats have been developed and are known to those of skill 

in the art^including those described in 5,242,974; 5,384,261; 5,405,783; 5,412,087; 
5,424,186; 5,429,807; 5,436,327; 5,445,934; 5,472,672; 5,527,681; 5,529,756; 5,545,531; 
5,554,501; 5,556,752; 5,561,071; 5,599,895; 5,624,711; 5,639,603; 5,658,734; WO 
93/17126; WO 95/1 1995; WO 95/35505; EP 742287; and EP 799897. The arrays of the 
subject invention find use in a variety of applications, including gene expression analysis, 
drug screening, mutation analysis and the like, as disclosed in the above-listed exemplary 
patent documents. 
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In addition to the above nucleic acid libraries, analogous libraries of polypeptides are 

also provided, where the where the polypeptides of the library will represent at least a 

portion of the polypeptides encoded by SEQ ID NOS: 1-844. 

5 VII. Utilities 

A. Use of Polynucleotide Probes in Mapping, and in Tissue Profiling 
Polynucleotide probes, generally comprising at least 12 contiguous nucleotides of a 
polynucleotide as shown in the Sequence Listing, are used for a variety of purposes, such as 
chromosome mapping of the polynucleotide and detection of transcription levels. Additional 

10 disclosure about preferred regions of the disclosed polynucleotide sequences is found in the 
Examples. A probe that hybridizes specifically to a polynucleotide disclosed herein should 
provide a detection signal at least 5-, 10-, or 20-fbld higher than the background 
hybridization provided with other unrelated sequences. 

Probes in Detection of Expression Levels. Nucleotide probes are used to detect 

15 expression of a gene corresponding to the provided polynucleotide. The references describe 
an example of a sandwich nucleotide hybridization assay. For example, in Northern blots, 
mRNA is separated electrophoretically and contacted with a probe. A probe is detected as 
hybridizing to an mRNA species of a particular size. The amount of hybridization is 
quantitated to determine relative amounts of expression, for example under a particular 

20 condition* Probes are also used to detect products of amplification by polymerase chain 
reaction. The products of the reaction are hybridized to the probe and hybrids are detected. 
Probes are used for in situ hybridization to cells to detect expression. Probes can also be 
used in vivo for diagnostic detection of hybridizing sequences. Probes are typically labeled 
with a radioactive isotope. Other types of detectable labels can be used such as 

25 chromophobes, fluors, and enzymes. Other examples of nucleotide hybridization assays are 
described in WO92/02526 and U.S. Patent No. 5,124,246. 

Alternatively, the Polymerase Chain Reaction (PCR) is another means for detecting 
small amounts of target nucleic acids (see, e.g., Mullis et al. t Meth. EnzymoL (1987) 
755:335; U.S. Patent No. 4,683,195; and U.S. Patent No. 4,683,202). Two primer 

30 polynucleotides nucleotides hybridize with the target nucleic acids and are used to prime the 
reaction. The primers can be composed of sequence within or 3' and 5" to the 
polynucleotides of the Sequence Listing. Alternatively, if the primers are 3' and 5' to these 
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polynucleotides, they need not hybridize to them or the complements. A men^Sf ^ 
polymerase creates copies of target nucleic acids from the primers using the original target 
nucleic acids as a template. After a large amount of target nucleic acids is generated by the 
polymerase, it is detected by methods such as Southern blots. When using the Southern blot 
method, the labeled probe will hybridize to a polynucleotide of the Sequence Listing or 
complement. 

Furthermore, mRNA or cDNA can be detected by traditional blotting techniques 
described in Sambrook et at., "Molecular Cloning: A Laboratory Manual" (New York, Cold 
Spring Harbor Laboratory, 1989). mRNA or cDNA generated from mRNA using a 
polymerase enzyme can be purified and separated using gel electrophoresis. The nucleic 
acids on the gel are then blotted onto a solid support, such as nitrocellulose. The solid 
support is.exposed to a labeled probe and then washed to remove any unhybridized probe. 
Next, the duplexes containing the labeled probe are detected. Typically, the probe is labeled 
with radioactivity. 

15 Mapping. Polynucleotides of me present invention are used to identify a 

chromosome on which the corresponding gene resides. Such mapping can be useful in 
identifying the function of the polynucleotide-related gene by its proximity to other genes 
with known function. Function can also be assigned to the polynucleotide-related gene when 
particular syndromes or diseases map to the same chromosome. For example, use of 
20 polynucleotide probes in identification and quantification of nucleic acid sequence 
aberrations is described in U.S. Patent No. 5,783,387. 

For example, fluorescence in situ hybridization (FISH) on normal metaphase spreads 
facilitates comparative genomic hybridization to allow total genome assessment of changes 
in relative copy number of DNA sequences. See Schwartz and Samad, Curr. Opin. 
25 BiotechnoL (1994) 5:70; Kaffioniemi etal., Sent. CancerBiol. (1993) *41; Valdes etal., 
Methods in Molecular Biology (1997) 68:1, Boultwood, ed., Human Press, Totowa, NJ. 
Preparations of human metaphase chromosomes are prepared using standard cytogenetic 
techniques from human primary tissues or cell lines. Nucleotide probes comprising at least 
12 contiguous nucleotides selected from the nucleotide sequence shown in the Sequence 
30 Listing are used to identify the corresponding chromosome. The nucleotide probes are 

labeled, for example, with a radioactive, fluorescent, biotinylated, or chemiluminescent label, 
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and detected by well known methods appropriate for the particular label selected. Protocols 

for hybridizing nucleotide probes to preparations of metaphase chromosomes are also well 

known in the art. A nucleotide probe will hybridize specifically to nucleotide sequences in 

the chromosome preparations that are complementary to the nucleotide sequence of the 

5 probe. ... 

Polynucleotides are mapped to particular chromosomes using, for example, radiation 

hybrids, or chromosome-specific hybrid panels. See Leach et aL, Advances in Genetics, 

(1995) 33:63-99; Walter et a/., Nature Genetics (1994) 7:22; Walter and Goodfellow, Trends 

in Genetics (1992) 0:352. Panels for radiation hybrid mapping are available from Research 

10 Genetics, Inc., Huntsville, Alabama, USA. Databases for markers using various panels are 
available via the world wide web at http:/F/shgc-www.stanford.edu; and http://www- 
genome.wi.mit.edu/cgi-bin/contig/rfunaDper.pL The statistical program RHMAP can be used 
to construct a map based on the data from radiation hybridization with a measure of the 
relative likelihood of one order versus another. RHMAP is available via the world wide web 

15 at http^/www.sph.umich.edu/group/statgen/software. 

In addition, commercial programs are available for identifying regions of 
chromosomes commonly associated with disease, such as cancer. Polynucleotides based on 
the polynucleotides of the invention can be used to probe these regions. For example, if 
through profile searching a provided polynucleotide is identified as corresponding to a gene 

20 encoding a kinase, its ability to bind to a cancer-related chromosomal region will suggest its 
role as a kinase in one or more stages of tumor cell development/growth. Although some 
experimentation would be required to elucidate the role, the polynucleotide constitutes a new 
material for isolating a specific protein that has potential for developing a cancer diagnostic 
or therapeutic. 

25 Tissue Typing or Profiling. Expression of specific mRNA corresponding to the 

provided polynucleotides can vary in different cell types and can be tissue-specific. This 
variation of mRNA levels in different cell types can be exploited with nucleic acid probe 
assays to determine tissue types. For example, PCR, branched DNA probe assays, or 
blotting techniques utilizing nucleic acid probes substantially identical or complementary to 

30 polynucleotides listed in the Sequence Listing can determine the presence or absence of the 
corresponding cDNA or mRNA. 
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For example, a metastatic lesion is identified by its developmental organ or tissue 

source by identifying the expression of a particular marker of that organ or tissue. If a 

polynucleotide is expressed only in a specific tissue type, and a metastatic lesion is found to 

express that polynucleotide, then the developmental source of the lesion has been identified. 

5 Expression of a particular polynucleotide is assayed by detection of either the corresponding 

mRNA or the protein product. Immunological methods, such as antibody staining, are used 

to detect a particular protein product. Hybridization methods can be used to detect particular 

mRNA species, including but not limited to in situ hybridization and Northern blotting. 

Use of Polymorphisms. A polynucleotide of the invention will be useful in forensics, 

) genetic analysis, mapping, and diagnostic applications if the corresponding region of a gene 

is polymorphic in the human population. Particular polymorphic forms of the provided 

polynucleotides can be used to either identify a sample as deriving from a suspect or rule out 

the possibility that the sample derives from the suspect. Any means for detecting a 

polymorphism in a gene are used, including but not limited to electrophoresis of protein 

polymorphic variants, differential sensitivity to restriction enzyme cleavage, and 

hybridization to allele-specific probes. 

B. Antibody Production 

Expression products of a polynucleotide of the invention, the corresponding mRNA 
or cDNA, or the corresponding complete gene are prepared and used for raising antibodies 
for experimental, diagnostic, and therapeutic purposes. For polynucleotides to which a 
corresponding gene has not been assigned, this provides an additional method of identifying 
the corresponding gene. The polynucleotide or related cDNA is expressed as described 
above, andantibodies are prepared. These antibodies are specific to an epitope on the 
polypeptide encoded by the polynucleotide, and can precipitate or bind to the corresponding 
native protein in a cell or tissue preparation or in a cell-free extract of an in vitro expression 
system. 

Immunogens for raising antibodies are prepared by mixing the polypeptides encoded 
by the polynucleotides of the present invention with adjuvants. Alternatively, polypeptides 
are made as fusion proteins to larger immunogenic proteins. Polypeptides are also 
covalently linked to other larger immunogenic proteins, such as keyhole limpet hemocyanin. 
Immunogens are typically administered intradermally, subcutaneously, or intramuscularly. 
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Immunogcns are administered to experimental animals such as rabbits, sheep, and mice, to 

generate antibodies. Optionally, the animal spleen cells are isolated and fused with myeloma 

cells to form hybridomas which secrete monoclonal antibodies. Such methods are well 

known in the art. According to another method known in the art, the selected polynucleotide 

5 . is administered directly, such as by intramuscular injection, and expressed in vivo. The 

expressed protein generates a variety of protein-specific immune responses, including 

production of antibodies, comparable to administration of the protein. 

Preparations of polyclonal and monoclonal antibodies specific for polypeptides 

encoded by a selected polynucleotide are made using standard methods known in the art. 

10 The antibodies specifically bind to epitopes present in the polypeptides encoded by 
polynucleotides disclosed in the Sequence Listing. Typically, at least 6, 8, 10, or 12 
contiguous amino acids are required to form an epitope. However, epitopes which involve 
non-contiguous amino acids may require more, for example at least 15, 25, or 50 amino 
acids. A short sequence of a polynucleotide may then be unsuitable for use as an epitope to 

1 5 raise antibodies for identifying the corresponding novel protein, because of the potential for 
cross-reactivity with a known protein. However, the antibodies can be useful for other 
purposes, particularly if they identify common structural features of a known protein and a 
novel polypeptide encoded by a polynucleotide of the invention. 

Antibodies that specifically bind to human polypeptides encoded by the provided 

20 polypeptides should provide a detection signal at least 5-, 10-, or 20-fold higher than a 
detection signal provided with other proteins when used in Western blots or other 
immunochemical assays. Preferably, antibodies that specifically polypeptides of the 
invention do not bind to other proteins in immunochemical assays at detectable levels and 
can immunoprecipitate the specific polypeptide from solution. 

25 To test for the presence of serum antibodies to the polypeptide of the invention in a 

human population, human antibodies are purified by methods well known in the art 
Preferably, the antibodies are affinity purified by passing antiserum over a column to which 
the corresponding selected polypeptide or fusion protein is bound. The bound antibodies can 
then be eluted from the column, for example using a buffer with a high salt concentration. 
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in addmon to the antibodies discussed above, genetically engineered antibody 

derivatives are made, such as single chain antibodies, according to methods well known in 
the art. 

C - Use of Polynucleotide s to Construct Arrays for Diagnostics 
Polynucleotide arrays provide a high throughput technique that can assay a large 
number of polynucleotide sequences in a sample. This technology can be used as a 
diagnostic and as a tool to test for differential expression to determine function of an 
encoded protein. Arrays can be created by spotting polynucleotide probes onto a substrate 
(e.g., glass, nitrocelllose, etc.) in a two-dimensional matrix or array having bound probes. 
The probes can be bound to the substrate by either covalent bonds or by non-specific 
interactions, such as hydrophobic interactions. Samples of polynucleotides can be detectably 
labeled (e.g, using radioactive or fluorescent labels) and then hybridized to the probes. 
Double stranded polynucleotides, comprising the labeled.sample polynucleotides bound to 
probe polynucleotides, can be detected once the unbound portion of the sample is washed 
15 away. Techniques for constructing arrays and methods of using these arrays are described in 
EP No. 0 799 897; PCT No. WO 97/29212; PCT No. WO 97/273 17; EP No. 0 785 280; PCT 
No. WO 97/02357; U.S. Pat No. 5,593,839; U.S. PaL No. 5.578,832; EP No. 0 728 520; 
U.S. Pat. No. 5,599,695; EP No. 0 72 1 0 1 6; U.S. Pat No. 5,556,752; PCT No. WO 
95/22058; and U.S. Pat No. 5,631,734. 
20 As discussed in some detail above, arrays can be used to examine differential 

expression of genes and can be used to determine gene function. For example, arrays of the 
instant polynucleotide sequences can be used to determine if any of the provided 
polynucleotides are differentially expressed between a test cell and control cell (e.g., cancer 
cells and normal cells). For example, high expression of a particular message in a cancer 
25 cell, which is not observed in a corresponding normal cell, can indicate a cancer specific 
protein. Exemplary uses of arrays are further described in, for example, Pappalarado et al., 
Sent Radiation Oncol. (1998) 5:217; and Ramsay Nature Biotechnol. (1998) /6":40. 
D. Differential Expression 

The polynucleotides of the invention can also be used to detect differences in 
30 expression levels between two cells, e.g. , as a method to identify abnormal or diseased tissue 
in a human. For polynucleotides corresponding to profiles of protein families as described 
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above, the choice of tissue can be selected according to the putative biological function. In 

general, the expression of a gene corresponding to a specific polynucleotide is compared 
between a first tissue that is suspected of being diseased and a second, normal tissue of the 
human. The tissue suspected of being abnormal or diseased can be derived from a different 
tissue type of the human, but preferably it is .derived from the same tissue type; for example 
an intestinal polyp or other abnormal growth should be compared with normal intestinal 
tissue. The normal tissue can be the same tissue as that of the test sample, or any normal 
tissue of the patient, especially those that express the polynucleotide-related gene of interest 
(e.g., brain, thymus, testis, heart, prostate, placenta, spleen, small intestine, skeletal muscle, 
pancreas, and the mucosal lining of the colon). A difference between the polynucleotide- 
related gene, mRNA, or protein in the two tissues which are compared, for example in 
molecular weight, amino acid or nucleotide sequence, or relative abundance, indicates a 
change in the gene, or a gene which regulates it, in the tissue of the human that was 
suspected of being diseased. Examples of detection of differential expression and its use in 
diagnosis of cancer are described in U.S. Patent Nos. 5,688,641 and 5,677,125. 

The polynucleotide-related genes in the two tissues are compared by any means 
known in the art For example, the two genes can be sequenced, and the sequence of the 
gene in the tissue suspected of being diseased compared with the gene sequence in the 
normal tissue. The genes corresponding to a provided polynucleotide, or portions thereof in 
the two tissues are amplified, for example using nucleotide primers based on the nucleotide 
sequence shown in the Sequence Listing, using the polymerase chain reaction. The 
amplified genes or portions of genes are hybridized to detectably labeled nucleotide probes 
selected from a nucleotide sequence shown in the Sequence Listing. A difference in the 
nucleotide sequence of the isolated gene in the tissue suspected of being diseased compared 
with the normal nucleotide sequence suggests a role of the gene product encoded by the 
subject polynucleotide in the disease, and provides guidance for preparing a therapeutic 
agent 

Alternatively, mRNA corresponding to a provided polynucleotide in the two tissues 
is compared. Poly A* RNA is isolated from the two tissues as is known in the art For 
example, one of skill in the art can readily determine differences in the size or amount of 
mRNA transcripts between the two tissues using Northern blots and detectably labeled 
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nucleotide probes selected from the nucleotide sequence shown in the Sequence Listing. 

Increased or decreased expression of a given mRNA in a tissue sample suspected of being 
diseased, compared with the expression of the same mRNA in a normal tissue, suggests that 
the expressed protein has a role in the disease, and also provides a lead for preparing a 
5 therapeutic agent. 

The comparison can also be accomplished by analyzing polypeptides between the 
matched samples. The sizes of the proteins in the two tissues are compared, for example, 
using antibodies of the present invekion to detect polypeptides in Western blots of protein 
extracts from the two tissues. Other changes, such as expression levels and subcellular 
10 localization, can also be detected immunologically, using antibodies to the corresponding 

protein. A higher or lower level of expression of a given polypeptide in a tissue suspected of 
being diseased, compared with the same protein expression level in a normal tissue, is 
indicative that the expressed protein has a role in the disease, and provides guidance for 
preparing a therapeutic agent 
15 Similarly, comparison of polynucleotide sequences or of gene expression products, 

e S-y mRNA and protein, between a human tissue that is suspected of being diseased and a 
normal tissue of a human, are used to follow disease progression or remission in the human. 
Such comparisons are made as described above. For example, increased or decreased 
expression of a gene corresponding to an inventive polynucleotide in the tissue suspected of 
20 being neoplastic can indicate the presence of neoplastic cells in the tissue. The degree of 
increased expression of a given gene in the neoplastic tissue relative to expression of the 
same gene in normal tissue, or differences in the amount of increased expression of a given 
gene in the neoplastic tissue over time, is used to assess the progression of the neoplasia in 
that tissueor to monitor the response of the neoplastic tissue to a therapeutic protocol over 
25 time. 

The expression pattern of any two cell types can be compared, such as low and high 
metastatic tumor cell lines, malignant or non-malignant cells, or cells from tissue which have 
and have not been exposed to a therapeutic agent A genetic predisposition to disease in a 
human is detected by comparing expression levels of an mRNA or protein corresponding to 
30 a polynucleotide of the invention in a fetal tissue with levels associated in normal fetal 
tissue. Fetal tissues that are used for this purpose include, but are not limited to, amniotic 
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fluid, chorionic villi, blood, and the blastomere of an in vitro-fertilized embryo. The 

comparable normal polynucleo tide-related gene is obtained from any tissue. The mRNA or 

protein is obtained from a normal tissue of a human in which the poiynucleotide-related gene 

is expressed. Differences such as alterations in the nucleotide sequence or size of the same 

5 product of the fetal poiynucleotide-related gene or mRNA, or. alterations in the molecular 

weight, amino acid sequence, or relative abundance of fetal protein, can indicate a germline 

mutation in the poiynucleotide-related gene of the fetus, which indicates a genetic 

predisposition to disease. Particular diagnostic and prognostic uses of the disclosed 

polynucleotides are described in more detail below. 

10 E. Diagnostic, Prognostic, and Other Uses Based On Differential Expression 

In general, diagnostic methods of the invention for involve detection of a level or 

amount of a gene product, particularly a differentially expressed gene product, in a test 

sample obtained from a patient suspected of having or being susceptible to a disease (e.g., 

breast cancer, lung cancer, colon cancer and/or metastatic forms thereof), and comparing the 

15 detected levels to those levels found in normal cells (e.g., cells substantially unaffected by 

cancer) and/or other control cells (e.g, to differentiate a cancerous cell from a cell affected 

by dysplasia). Furthermore, the severity of the disease can be assessed by comparing the 

detected levels of a differentially expressed gene product with those levels detected in 

samples representing the levels of differentially gene product associated with varying 

20 degrees of severity of disease. 

The term "differentially expressed gene" is intended to encompass a polynucleotide 

that can, for example, include an open reading frame encoding a gene product (e.g.* a 

polypeptide), and/or introns of such genes and adjacent 5' and 3' non-coding nucleotide 

sequences involved in the regulation of expression, up to about 20 kb beyond the coding 

25 region, but possibly further in either direction. The gene can be introduced into an 

appropriate vector for extrachromosomal maintenance or for integration into a host genome. 

In general, a difference in expression level associated with a decrease in expression level of 

at least about 25%, usually at least about 50% to 75%, more usually at least about 90% or 

more is indicative of a differentially expressed gene of interest, /.e., a gene that is 

30 underexpressed or down-regulated in the test sample relative to a control sample. 

Furthermore, a difference in expression level associated with an increase in expression of at 
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least about 25%, usually at least about 50% to 75%, more usually at least about 90% and can 

be at least about 1 '/i-fold, usually at least about 2-fold to about 10-fold, and can be about 

100-fold to about 1,000-fold increase relative to a control sample is indicative of a 

differentially expressed gene of interest, i.e., an overexpressed or up-regulated gene. 

"Differentially expressed polynucleotide" as used herein means a nucleic acid 

molecule (RNA or DNA) having a sequence that represents a differentially expressed gene, 

e.g., the differentially expressed polynucleotide comprises a sequence (e.g., an open reading 

frame encoding a gene product) that uniquely identifies a differentially expressed gene so 

that detection of the differentially expressed polynucleotide in a sample is correlated with the 

presence of a differentially expressed gene in a sample. "Differentially expressed 

polynucleotides" is also meant to encompass fragments of the disclosed polynucleotides, 

e.g., fragments retaining biological activity, as well as nucleic acids homologous, 

substantially similar, or substantially identical (e.g., having about 90% sequence identity) to 

the disclosed polynucleotides. 

Methods of the subject invention useful in diagnosis or prognosis typically involve 

comparison of the abundance of a selected differentially expressed gene product in a sample 

of interest with that of a control to determine any relative differences in the expression of the 

gene product, where the difference can be measured qualitatively and/or quantitatively. 

Quantitation can be accomplished, for example, by comparing the level of expression 

product detected in the sample with the amounts of product present in a standard curve. A 

comparison can be made visually; by using a technique such as densitometry, with or 

without computerized assistance; by preparing a representative library of cDNA clones of 

mRNA isolated from a test sample, sequencing the clones in the library to determine that 

number ofcDNA clones corresponding to the same gene product, and analyzing the number 

of clones corresponding to that same gene product relative to the number of clones of the 

same gene product in a control sample; or by using an array to detect relative levels of 

hybridization to a selected sequence or set of sequences, and comparing the hybridization 

pattern to that of a control. The differences in expression are then correlated with the 

presence or absence of an abnormal expression pattern. A variety of different methods for 

determining the nucleic acid abundance in a sample are known to those of skill in the art, 

where particular methods of interest include those described in: Pietu et cd. Genome Res. 
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(1996) 6:492; Zhao et ai, Gene (1995) 156:207; Soares , Curr. Opin. BiotechnoL (1977) 8: 

542; Raval, J. Pharmacol Toxicol Methods (1994) 52:125; Chalifour et aL. Anal Biochem 

(1994) 216:299; Stolz et aL, Mot. BiotechnoL (1996) 6:225; Hong et aL, BioscL Reports 

(1982) 2:907; and McGraw, AnaL Biochem. (1984) I43:29Z. Also of interest are the 

5 methods disclosed in WO 97/273 1 7, the disclosure of which is herein incorporated by 

reference. 

In general, diagnostic assays of the invention involve detection of a gene product of a 
the polynucleotide sequence (e.g., mRNA or polypeptide) that corresponds to a sequence of 
SEQ ID NOS: 1-844. The patient from whom the sample is obtained can be apparently 

10 healthy, susceptible to disease (e.g., as determined by family history or exposure to certain 
environmental factors), or can already be identified as having a condition in which altered 
expression of a gene product of the invention is implicated. 

In the assays of the invention, the diagnosis can be determined based on detected 
gene product expression levels of a gene product encoded by at least one, preferably at least 

15 two or more, at least 3 or more, or at least 4 or more of the polynucleotides having a 

sequence set forth in SEQ ID NOS: 1-844, and can involve detection of expression of genes 
corresponding to all of SEQ ID NOS: 1-844 and/or additional sequences that can serve as 
additional diagnostic markers and/or reference sequences. Where the diagnostic method is 
designed to detect the presence or susceptibility of a patient to cancer, the assay preferably 

20 involves detection of a gene product encoded by a gene corresponding to a polynucleotide 
that is differentially expressed in cancer. For example, a higher level of expression of a 
polynucleotide corresponding to SEQ ID NO:52 relative to a level associated with a normal 
sample can indicate the presence of cancer in the patient from whom the sample is derived. 
In another example, detection of a lower level of a polynucleotide corresponding to SEQ ID 

25 NO 39 relative to a normal level is indicative of the presence of cancer in the patient 
Further examples of such differentially expressed polynucleotides are described in the 
Examples below. Given the provided polynucleotides and information regarding their 
relative expression levels provided herein, assays using such polynucleotides and detection 
of their expression levels in diagnosis and prognosis will be readily apparent to the ordinarily 

3 0 skilled artisan. 
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Any of a vanety of detectable labels can be used in connection with the various 

embodiments of the diagnostic methods of the invention. Suitable detectable labels include 
fluorochromes,(e .g. fluorescein isothiocyanate (FITC), rhodamine, Texas Red, 
phycoerythrin, allophycocyanin, 6-carboxyfluorescein (6-FAM), 2*,7'-dimethoxy-4*,5'- 
5 dichloro-6-carboxyfluorescein (JOE), 6-carboxy-X-rhodamine (ROX), 6-carboxy- 

2\4\7\4,7-hexachlorofluorescein (HEX), 5-carboxyfluorescein (5-FAM) or N,N,N',N'- 
tetramethyI-6-carboxyrhodamine (TAMRA)), radioactive labels, {e.g. "P, 35 S, J H, etc.), and 
the like. The detectable label can involve a two stage systems {e.g., biotin-avidin, hapten- 
anti-hapten antibody, etc.) 

10 Reagents specific for the polynucleotides and polypeptides of the invention, such as 

antibodies and nucleotide probes, can be supplied in a kit for detecting the presence of an 
expression product in a biological sample. The kit can also contain buffers or labeling 
components, as well as instructions for using the reagents to detect and quantify expression 
products in the biological sample. Exemplary embodiments of the diagnostic methods of the 
1 5 invention are described below in more detail. 

Polypeptide detection in diagnosis. In one embodiment, the test sample is assayed 
for the level of a differentially expressed polypeptide. Diagnosis can be accomplished using 
any of a number of methods to determine the absence or presence or altered amounts of the 
differentially expressed polypeptide in the test sample. For example, detection can utilize 
20 staining of cells or histological sections with labeled antibodies, performed in accordance 
with conventional methods. Cells can be permeabilized to stain cytoplasmic molecules. In 
general, antibodies that specifically bind a differentially expressed polypeptide of the 
invention are added to a sample, and incubated for a period of time sufficient to allow 
binding to Hie epitope, usually at least about 10 minutes. The antibody can be detectably 
25 labeled for direct detection (e.g. t using radioisotopes, enzymes, fluorescers, 

chemiluminescers, and the like), or can be used in conjunction with a second stage antibody 
or reagent to detect binding (e.g., biotin with horseradish peroxidase-conjugated avidin, a 
secondary antibody conjugated to a fluorescent compound, e.g. fluorescein, rhodamine, 
Texas red, etc.). The absence or presence of antibody binding can be determined by various 
30 methods, including flow cytometry of dissociated cells, microscopy, radiography, 

scintillation counting, etc. Any suitable alternative methods can of qualitative or quantitative 
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detection of levels or amounts of differentially expressed polypeptide can be used, for 

example ELISA, western blot, immunoprecipitation, radioimmunoassay, etc. 

In general, the detected level of differentially expressed polypeptide in the test 
sample is compared to a level of the differentially expressed gene product in a reference or 
control sample, e.g., in a normal cell (negative control) or in a cell having a known disease 
state (positive control). For example, a higher level of expression of a polypeptide encoded 
by SEQ ID NO:52 relative to a level associated with a normal sample can indicate the 
presence of cancer in the patient from whom the sample is derived. In another example, 
detection of a lower level of the polypeptide encoded by SEQ ID NO:39 relative to a normal 
level is indicative of the presence of cancer in the patient. 

mRNA detection. The diagnostic methods of the invention can also or alternatively 
involve detection of mRNA encoded by a gene corresponding to a differentially expressed 
polynucleotides of the invention. Any suitable qualitative or quantitative methods known in 
the art for detecting specific mRNAs can be used. mRNA can be detected by, for example, 
in situ hybridization in tissue sections, by reverse transcriptase-PCR, or in Northern blots 
containing poly A+ mRNA. One of skill in the art can readily use these methods to 
determine differences in the size or amount of mRNA transcripts between two samples. For 
example, the level of mRNA of the invention in a tissue sample suspected of being 
cancerous or dysplastic is compared with the expression of the mRNA in a reference sample, 
e.g., a positive or negative control sample (e.g., normal tissue, cancerous tissue, etc.). In a 
specific non-limiting example, a higher level of mRNA corresponding to SEQ ID NO:52 
relative to a level associated with a normal sample can indicate the presence of cancer in the 
patient from whom the sample is derived. In another. example, detection of a lower level of 
mRNA corresponding to SEQ ID NO:39 relative to a normal level is indicative of the 
presence of cancer in the patient 

Any suitable method for detecting and comparing mRNA expression levels in a 
sample can be used in connection with the diagnostic methods of the invention (see, e.g., 
U.S. 5,804,382). For example, mRNA expression levels in a sample can be determined by 
generation of a library of expressed sequence tags (ESTs) from the sample, where the EST 
library is representative of sequences present in the sample (Adams, et al M (1991) Sctence 
252: 165 1). Enumeration of the relative representation of ESTs within the library can be used 
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to approximate the relative representation of the gene transcript within the starting sample. 

The results of EST analysts of a test sample can then be compared to EST analysis of a 
reference sample to determine the relative expression levels of a selected polynucleotide, 
particularly a polynucleotide corresponding to one or more of the differentially expressed 
genes described herein. 

Alternatively, gene expression in a test sample can be performed using serial analysis 
of gene expression (SAGE) methodology (Velculescu et al., Science (1995) 270:484). In 
short, SAGE involves the isolation of short unique sequence tags from a specific location 
within each transcript {e.g., a sequence of any one of SEQ ID NOS.1-6). The sequence tags 
are concatenated, cloned, and sequenced. The frequency of particular transcripts within the 
starting sample is reflected by the number of times the associated sequence tag is 
encountered with the sequence population. 

Gene expression in a test sample can also be analyzed using differential display (DD) 
methodology. In DD, fragments defined by specific sequence delimiters (e.g., restriction 
15 enzyme sites) are used as unique identifiers of genes, coupled with information about 

fragment length or fragment location within the expressed gene. The relative representation 
of an expressed gene with a sample can then be estimated based on the relative 
representation of the fragment associated with that gene within the pool of all possible 
fragments. Methods and compositions for carrying out DD are well known in the art, see, 
20 e.g., U.S. 5,776,683; and U.S. 5,807,680. 

Alternatively, gene expression in a sample using hybridization analysis, which is 
based on the specificity of nucleotide interactions. Oligonucleotides or cDNA can be used to 
selectively identify or capture DNA or RNA of specific sequence composition, and the 
amount of RNA or cDNA hybridized to a known capture sequence determined qualitatively 
25 or quantitatively, to provide information about the relative representation of a particular 
message within the pool of cellular messages in a sample. Hybridization analysis can be 
designed to allow for concurrent screening of the relative expression of hundreds to 
thousands of genes by using, for example, array-based technologies having high density 
formats, including filters, microscope slides, or microchips, or solution-based technologies 
30 that use spectroscopic analysis (e.g., mass spectrometry). One exemplary use of arrays in the 
diagnostic methods of the invention is described below in more detail. 
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Use of a single gene in diagnostic applications. The diagnostic methods of the 

invention can focus on the expression of a single differentially expressed gene. For example, 

the diagnostic method can involve detecting a differentially expressed gene, or a 

polymorphism of such a gene (e.g., a polymorphism in an coding region or control region), 

that is associated with disease. Disease-associated polymorphisms can include deletion or 

truncation of the gene, mutations that alter expression level and/or affect activity of the 

encoded protein, etc. 

Changes in the promoter or enhancer sequence that affect expression levels of an 

differentially gene can be compared to expression levels of the normal allele by various 

methods known in the art* Methods for determining promoter or enhancer strength include 

quantitation of the expressed natural protein; insertion of the variant control element into a 

vector with a reporter gene such as p-galactosidase, luciferase, chloramphenicol 

acetyltransferase, etc. that provides for convenient quantitation; and the like. 

A number of methods are available for analyzing nucleic acids for the presence of a 

specific sequence, e.g. a disease associated polymorphism. Where large amounts of DNA 

are available, genomic DNA is used directly. Alternatively, the region of interest is cloned 

into a suitable vector and grown in sufficient quantity for analysis. Cells that express a 

differentially expressed gene can be used as a source of mRNA, which can be assayed 

directly or reverse transcribed into cDNA for analysis. The nucleic acid can be amplified by 

conventional techniques, such as the polymerase chain reaction (PCR), to provide sufficient 

amounts for analysis, and a detectable label can be included in the amplification reaction 

(e.g., using a detectably labeled primer or detectably labeled oligonucleotides) to facilitate 

detection. The use of the polymerase chain reaction is described in Saiki, et al., Science 

(1985) 250:487, and a review of techniques can be found in Sambrook, et al., Molecular 

Cloning: A Laboratory Manual, (1989) pp. 14.2. Alternatively, various methods are known 

in the art that utilize oligonucleotide ligation as a means of detecting polymorphisms, for 

examples see Riley et al.^ Nucl Acids Res. (1990) 75:2887; and Deiahunty et al., Am. J. 

Hum. Genet. (1996) 55:1239. 

The sample nucleic acid, e.g. amplified or cloned fragment, is analyzed by one of a 

number of methods known in the art. The nucleic acid can be sequenced by dideoxy or other 

methods, and the sequence of bases compared to a selected sequence, e.g., to a wild-type 
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sequence. Hybridization with the polymorphic or variant sequence can also be used to 

determine its presence in a sample (e.g., by Southern blot, dot blot, etc.). The hybridization 

pattern of a polymorphic or variant sequence and a control sequence to an array of 

oligonucleotide probes immobilized on a solid support, as described in US 5,445,934, or in 

WO 95/35505, can also be used as a means of identifying polymorphic or variant sequences 

associated with disease. Single strand conformational polymorphism (SSCP) analysis, 

denaturing gradient gel electrophoresis (DGGE), and heteroduplex analysis in gel matrices 

are used to detect conformational changes created by DNA sequence variation as alterations 

in electrophoretic mobility. Alternatively, where a polymorphism creates or destroys a 

recognition site for a restriction endonuclease, the sample is digested with that endonuclease, 

and the products size fractionated to determine whether the fragment was digested. 

Fractionation is performed by gel or capillary electrophoresis, particularly acrylamide or 

agarose gels. 

Screening for mutations in an differentially expressed gene can be based on the 
functional or antigenic characteristics of the protein. Protein truncation assays are useful in 
detecting deletions that can affect the biological activity of the protein. Various 
immunoassays designed to detect polymorphisms in proteins can be used in screening. 
Where many diverse genetic mutations lead to a particular disease phenotype, functional 
protein assays have proven to be effective screening tools. The activity of the encoded 
protein can be determined by comparison with the wild-type protein. 

Pattern matching in diagnosis using arrays. In another embodiment, the diagnostic 
and/or prognostic methods of the invention involve detection of expression of a selected set 
of genes in a test sample to produce a test expression pattern (TEP). The TEP is compared to 
a reference expression pattern (REP), which is generated by detection of expression of the 
selected set of genes in a reference sample (e.g. 9 a positive or negative control sample). The 
selected set of genes includes at least one of the genes of the invention, which genes 
correspond to the polynucleotide sequences of SEQ ID NOS: 1 -844. Of particular interest is 
a selected set of genes that includes gene differentially expressed in the disease for which the 
test sample is to be screened. 

"Reference sequences' 1 or "reference polynucleotides" as used herein in the context of 
differential gene expression analysis and diagnosis/prognosis refers to a selected set of 
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polynucleotides, which selected set includes at least one or more of the differentially 

expressed polynucleotides described herein. A plurality of reference sequences, preferably 

comprising positive and negative control sequences, can be included as reference sequences. 

Additional suitable reference sequences are found in Genbank, Unigene, and other 

nucleotide sequence databases (including, e.g., expressed sequence tag (EST), partial, and 

full-length sequences). 

"Reference array" means an array having reference sequences for use in hybridization 

with a sample, where the reference sequences include all, at least one of, or any subset of the 

differentially expressed polynucleotides described herein. Usually such an array will include 

at least 3 different reference sequences, and can include any one or all of the provided 

differentially expressed sequences. Arrays of interest can further comprise sequences, 

including polymorphisms, of other genetic sequences, particularly other sequences of interest 

for screening for a disease or disorder (e.g.* cancer, dysplasia, or other related or unrelated 

diseases, disorders, or conditions). The oligonucleotide sequence on the array will usually 

be at least about 12 nt in length, and can be of about the length of the provided sequences, or 

can extend into the flanking regions to generate fragments of 1 00 nt to 200 nt in length or 

more. 

A "reference expression pattern" or "REP" as used herein refers to the relative levels 
of expression of a selected set of genes, particularly of differentially expressed genes, that is 
associated with a selected cell type, e.g., a norma] cell, a cancerous cell, a cell exposed to an 
environmental stimulus, and the like. A "test expression pattern" or "TEP" refers to relative 
levels of expression of a selected set of genes, particularly of differentially expressed genes, 
in a test sample (e.g., a cell of unknown or suspected disease state, from which mRNA is 
isolated). 

"Diagnosis" as used herein generally includes determination of a subject's 
susceptibility to a disease or disorder, determination as to whether a subject is presently 
affected by a disease or disorder, as well as to the prognosis of a subject affected by a disease 
or disorder (e.g., identification of pre-metastatic or metastatic cancerous states, stages of 
cancer, or responsiveness of cancer to therapy). The present invention particularly 
encompasses diagnosis of subjects in the context of breast cancer (e.g., carcinoma in situ 
(e.g., ductal carcinoma in situ), estrogen receptor (ER)-positive breast cancer, ER-negative 
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breast cancer, or other forms and/or stages of breast cancer), lung cancer (e.g. t small cell 

carcinoma, non-small cell carcinoma, mesothelioma, and other forms and/or stages of lung 
cancer), and colon cancer (e.g. , adenomatous polyp, colorectal carcinoma, and other forms 
and/or stages of colon cancer). 

"Sample" or "biological sample" as used throughout here are generally meant to refer 
to samples of biological fluids or tissues, particularly samples obtained from tissues, 
especially from cells of the type associated with the disease for which the diagnostic 
application is designed (e.g., ductal adenocarcinoma), and the like: "Samples" is also meant 
to encompass derivatives and fractions of such samples (e.g., cell lysates). Where the sample 
is solid tissue, the cells of the tissue can be dissociated or tissue sections can be analyzed. 

REPs can be generated in a variety of ways according to methods well known in the 
art. For example, REPs can be generated by hybridizing a control sample to an array having 
a selected set of polynucleotides (particularly a selected set of differentially expressed 
polynucleotides), acquiring the hybridization data from the array, and storing the data in a 
1 5 format that allows for ready comparison of the REP with a TEP. Alternatively, all expressed 
sequences in a control sample can be isolated and sequenced, e.g., by isolating mRNA from 
a control sample, converting the mRNA into cDNA, and sequencing the cDNA. The 
resulting sequence information roughly or precisely reflects the identity and relative number 
of expressed sequences in the sample. The sequence information can then be stored in a 
20 format (e.g. , a computer-readable format) that allows for ready comparison of the REP with 
a TEP . The REP can be no rm al iz ed prior to or after data storage, and/or can be processed to 
selectively remove sequences of expressed genes that are of less interest or that might 
complicate-analysis (e.g. t some or all of the sequences associated with housekeeping genes 
can be eliminated from REP data). 
25 TEPs can be generated in a manner similar to REPs, e.g. , by hybridizing a test sample 

to an array having a selected set of polynucleotides, particularly a selected set of 
differentially expressed polynucleotides, acquiring the hybridization data from the array, and 
storing the data in a format that allows for ready comparison of the TEP with a REP. The 
REP and TEP to be used in a comparison can be generated simultaneously, or the TEP can 
30 be compared to previously generated and stored REPs. 
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In one embodiment of the invention, comparison of a TEP with a REP involves 

hybridizing a test sample with a reference array, where the reference array has one or more 

reference sequences for use in hybridization with a sample. The reference sequences include 

all, at least one of, or any subset of the differentially expressed polynucleotides described 

5 herein. Hybridization data for the test sample is acquired, the data normalized, and the 

produced TEP compared with a REP generated using an array having the same or similar 

selected set of differentially expressed polynucleotides. Probes that correspond to sequences 

differentially expressed between the two samples will show decreased or increased 

hybridization efficiency for one of the samples relative to the other. 

10 Reference arrays can be produced according to any suitable methods known in the 

art For example, methods of producing large arrays of oligonucleotides are described in 
U.S. 5,134,854, and U.S. 5,445,934 using light-rdirected synthesis techniques. Using a 
computer controlled system, a heterogeneous array of monomers is converted, through 
simultaneous coupling at a number of reaction sites, into a heterogeneous array of polymers. 

15 Alternatively, microarrays are generated by deposition of pre-synthesized oligonucleotides 
onto a solid substrate, for example as described in PCT published application no. 
WO 95/35505. 

Methods for collection of data from hybridization of samples with a reference arrays 
are also well known in the art. For example, the polynucleotides of the reference and test 

20 samples can be generated using a detectable fluorescent label, and hybridization of the 

polynucleotides in the samples detected by scanning the microarrays for the presence of the 
detectable label. Methods and devices for detecting fluorescently marked targets on devices 
are known in the art Generally, such detection devices include a microscope and light 
source for directing light at a substrate. A photon counter detects fluorescence from the 

25 substrate, while an x-y translation stage varies the location of the substrate. A confocal 
detection device that can be used in the subject methods is described in U.S. Patent no. 
5,631,734. A scanning laser microscope is described in Shalon et aL, Genome Res, (1996) 
5:639. A scan, using the appropriate excitation line, is performed for each fluorophore used. 
The digital images generated from the scan are then combined for subsequent analysis. For 

30 any particular array element, the ratio of the fluorescent signal from one sample (e.g., a test 
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sample) is compared to the fluorescent signal from another sample (e.g., a reference sample), 
and the relative signal intensity determined. 

Methods for analyzing the data collected from hybridization to arrays are well known 
in the art. For example, where detection of hybridization involves a fluorescent label, data 
analysis can include the steps of determining fluorescent intensity as a function of substrate 
position from the data collected, removing outliers, i.e. data deviating from a predetermined 
statistical distribution, and calculating the relative binding affinity of the targets from the 
remaining data. The resulting data can be displayed as an image with the intensity in each 
region varying according to the binding affinity between targets and probes. 

In general, the test sample is classified as having a gene expression profile 
corresponding to that associated with a disease or non-disease state by comparing the TEP 
generated from the test sample to one or more REPs generated from reference samples (e.g., 
from samples associated with cancer or specific stages of cancer, dysplasia, samples affected 
by a disease other than cancer, normal samples, etc.). The criteria for a match or a 
substantial match between a TEP and a REP include expression of the same or substantially 
the same set of reference genes, as well as expression of these reference genes at 
substantially the same levels (e.g., no significant difference between the samples for a signal 
associated with a selected reference sequence after normalization of the samples, or at least 
no greater than about 25% to about 40% difference in signal strength for a given reference 
sequence. In general, a pattern match between a TEP and a REP includes a match in 
expression, preferably a match in qualitative or quantitative expression level, of at least one 
of, all onany subset of the differentially expressed genes of the invention. 

Pattern matching can be performed manually, or can be performed using a computer 
program^Methods for preparation of substrate matrices (e.g., arrays), design of 
oligonucleotides for use with such matrices, labeling of probes, hybridization conditions, 
scanning of hybridized matrices, and analysis of patterns generated, including comparison 
analysis, are described in, for example, U.S. 5,800,992. 

F- Use of the Polynucle otides of the Invention in Cancer 
Oncogenesis involves the unbridled growth, differentiation and abnormal 
migration of cells. Cancerous cells can have the ability to compress, invade, and destroy 
normal tissue. Cancerous cells may also metastasize to other parts of the body via the 
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bloodstream or the lymph system and colonize in these other areas. Different cancers are 

classified by the cell from which the cancerous cell is derived and from its cellular 
morphology and/or state of differentiation. 

Somatic genetic abnormalities cause cancer initiation and progression. Cancer 
5 generally is clonally formed, i.e. gain of function of oncogenes and loss of function of tumor 
suppressor genes within a single cell transform the cell to be cancerous, and that single cell 
grows and divides to form a cancerous lesion. The genes known to be involved in cancer 
initiation and progression are involved in numerous cellular functions, including 
developmental differentiation, ceil cycle regulation, cell signaling, immunological response, 
1 0 DNA replication, and DN A repair. 

The identification and characterization of genetic or biochemical markers in blood or 
tissues that will detect the earliest changes along the carcinogenesis pathway and monitor the 
efficacy of various therapies and preventive interventions is a major goal of cancer research. 
Scientists have identified genetic changes in stool specimens that indicate the stages of colon 
15 cancer, and other biomarkers such as gene mutations, hormone receptors, proteins that 
inhibit metastasis, and enzymes that metabolize drugs are all being used to determine the 
severity and predict the course of breast, prostate, lung, and other cancers. 

Recent advances in the pathogenesis of certain cancers has been helpful in 
determining patient treatment. The level of expression of certain polynucleotides can be 
20 indicative of a poorer prognosis, and therefore warrant more aggressive chemo- or radio- 
therapy for a patient The correlation of novel surrogate tumor specific features with 
response to treatment and outcome in patients has defined certain prognostic indicators that 
allow the design of tailored therapy based on the molecular profile of the tumor. These 
therapies include antibody targeting and gene therapy. Moreover, a promising level of one 
25 or more marker polynucleotides can provide impetus for not aggressively treating a 

particular patient, thus sparing the patient the deleterious side effects of aggressive therapy. 
Determining expression of certain polynucleotides and comparison of a patients profile with 
known expression in normal tissue and variants of the disease allows a determination of the 
best possible treatment for a patient, both in terms of specificity of treatment and in terms of 
30 comfort level of the patient 
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Surrogate tumor markers, such as polynucleotide expression, can also be used to 

better classify, and thus diagnose and treat, different forms and disease states of cancer. 
Two classifications widely used in oncology that can benefit from identification of the 
expression levels of the polynucleotides of the invention are staging of the cancerous 
disorder, and grading the nature of the cancerous tissue. 

Staging. Staging is a process used by physicians to describe how advanced the 
cancerous state is in a patient. Staging assists the physician in determining a prognosis, 
planning treatment and evaluating the results of such treatment. Different staging systems 
are used for different types of cancer, but each generally involves the following 
determinations: the type of tumor, indicated by T; whether the cancer has metastasized to 
nearby lymph nodes, indicated by N; and whether the cancer has metastasized to more 
distant parts of the body, indicated by M. This system of staging is called the TNM system. 
Generally, if a cancer is only detectable in the area of the primary lesion without having 
spread to any lymph nodes it is called Stage L If it has spread only to the closest lymph 
nodes, it is called Stage IL In Stage III, the cancer has generally spread to the lymph nodes 
in near proximity to the site of the primary lesion. Cancers that have spread to a distant part 
of the body, such as the liver, bone, brain or another site, are called Stage IV, the most 
advanced stage. 

Currently, the determination of staging is done using pathological techniques and is 
based more on the presence or absence of malignant tissue rather than the characteristics of 
the tumor type. Presence or absence of malignant tissue is based primarily on the gross 
morphology of the cells in the areas biopsied. The polynucleotides of the invention can 
facilitate fine-tuning of the staging process by identifying markers for the aggresivity of a 
cancer, e.g. the metastatic potential, as well as the presence in different areas of the body. 
Thus, a Stage II cancer with a polynucleotide signifying a high metastatic potential cancer 
can be used to change a borderline Stage II tumor to a Stage III tumor, justifying more 
aggressive therapy. Conversely, the presence of a polynucleotide signifying a lower 
metastatic potential allows more conservative staging of a tumor. 

Grading of cancers. Grade is a term used to describe how closely a tumor resembles 
normal tissue of its same type. Based on the microscopic appearance of a tumor, 
pathologists will identify the grade of a tumor based on parameters such as cell morphology, 
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cellular organization, and other markers of differentiation. As a general rule, the grade of a 

tumor corresponds to its rate of growth or aggressiveness. * That is, undifferentiated or high- 
grade tumors grow more quickly than well differentiated or low-grade tumors. Information 
about tumor grade is useful in planning treatment and predicting prognosis. 
5 The American Joint Commission on Cancer has recommended the following 

guidelines for grading tumors: 1) GX Grade cannot be assessed; 2) Gl Well differentiated; 
G2 Moderately well differentiated; 3) G3 Poorly differentiated; 4) G4 Undifferentiated. 
Although grading is used by pathologists to describe most cancers, it plays a more important 
role in treatment planning for certain types than for others. An example is the Gleason 
10 system that is specific for prostate cancer, which uses grade numbers to describe the degree 
of differentiation. Lower Gleason scores indicate well-differentiated cells. Intermediate 
scores denote tumors with moderately differentiated cells. Higher scores describe poorly 
differentiated cells. Grade is also important in some types of brain tumors and soft tissue 
sarcomas. 

15 The polynucleotides of the invention can be especially valuable in determining the 

grade of the tumor, as they not only can aid in determining the differentiation status of the 
cells of a tumor, they can also identify factors other than differentiation that are valuable in 
determining the aggressivity of a tumor, such as metastatic potential. 

Familial Cancer Genes. A number of cancer syndromes are linked to Mendelian 

20 inheritance of a predisposition to develop particular cancers. The following table contains a 
list of cancer types that can be inherited, and for which the gene or genes responsible have 
been identified. Most of the cancer types listed can occur as part of several different genetic 
conditions, each caused by alterations in a different gene. 



. Cancer Type 


Genetic Condition 


Gene 


Brain 




Li-Fraumeni syndrome 
Neurofibromatosis 1 
Neurofibromatosis 2 
von Hippel-Lindau syndrome 
Tuberous sclerosis 2 


TP53 

NF1 

NF2 

VHL 

TSC2 


Breast 




Hereditary breast/ovarian cancer 1 
Hereditary breast/ovarian cancer 2 
Li-Fraumeni syndrome 
Ataxia telangiectasia 


BRCAl 
BRCA2 
TP53 
ATM 


Colon 


Familial adenomatous polyposis (FAP) 
Hereditary non-polyposis colon cancer (HNPCC) 1 
Hereditary non-polyposis colon cancer (HNPCC) 2 


APC 

HMSH2 

hMLHl 
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Cancer Type 


Genetic Condition 
Hereditary non-polyposis colon cancer (HNPCC) 3 
Hereditary non-polyposis colon cancer (HNPCC) 4 


Gene 
hPMSl 
hPMS2 


Endocrine 

(parathyroid, pituitary, GI endocrine) 


Multiple endocrine neoplasia I (MEN1) 


MEN I 


Endocrine 

(pheochromacytoma, medullary thyroid) 


Multiple endocrine neoplasia 2 (MEN2) 


RET 


Endometrial 


Hereditary non-polyposis colon cancer (HNPCC) I 
Hereditary non-polyposis colon cancer (HNPCC) 2 
Hereditary non-polyposis colon cancer (HNPCC) 3 
Hereditary non-polyposis colon cancer (HNPCC) 4 


hMSH2 
hMLHl 
hPMSl 
hPMS2 


Eye 


Hereditary retinoblastoma 


RBI 


Hematologic 

(lymphomas and leukemia) 


Li-Fraumeni syndrome 
Ataxia telangiectasia 


TP53 
ATM 


Kidney 


Hereditary Wilms* tumor 
von Hippel-Lindau syndrome 
Tuberous sclerosis 2 


WT1 
VHL 
TSC2 


Ovary 


Hereditary breast/ovarian cancer 1 
Hereditary breast/ovarian cancer 2 


BRCA1 
BRCA2 


Sarcoma 


Hereditary retinoblastoma 
Li-Fraumeni syndrome 
Neurofibromatosis 1. 


RBI 

TP53 

NF1 


Skin 


Hereditary melanoma 1 
Hereditary melanoma 2 
Basal cell naevus (Goriin) syndrome 


CDKN2 

CDK4 

PTCH 


Stomach 


Hereditary non-polyposis colon cancer (HNPCC) 1 
Hereditary non-polyposis colon cancer (HNPCC) 2 
Hereditary non-polyposis colon cancer (HNPCC) 3 
Hereditary non-polyposis colon cancer (HNPCC) 4 


hMSH2 
hMLHl 
hPMSl 
hPMS2 



The polynucleotides of the invention can be especially useful to monitor patients having any 
of the abovejsyndrornes to detect potentially malignant events at a molecular level before 
they are detectable at a gross morphological level. As can be seen from the table, a number 
5 of genes are involved in multiple forms of cancer. Thus, a polynucleotide of the invention 
identified as important for metastatic colon cancer can also have clinical implications for a 
patient diagnosed with stomach cancer or endometrial cancer. 

Lung Cancer. Lung cancer is one of the most common cancers in the United States, 
accounting for about 15 percent of all cancer cases, or 170,000 new cases each year. At this 
10 time, over half of the lung cancer cases in the United States are in men, but the number found 
in women is increasing and will soon equal that in men. Today more women die of lung 
cancer than of breast cancer. Lung cancer is especially difficult to diagnose and treat 
because of the large size of the lungs, which allows cancer to develop for years undetected. 
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In fact, lung cancer can spread outside the lungs without causing any symptoms. Adding to 

the confusion, the most common symptom of lung cancer, a persistent cough, can often be 

mistaken for a cold or bronchitis. 

Although there are more than a dozen different kinds of lung cancer, the two main 
types of lung cancer .are small cell and nonsmall cell, which encompass about 90% of all 
lung cancer cases. Small cell carcinoma (also called oat cell carcinoma), which usually starts 
in on$ of the larger bronchial tubes, grows fairly rapidly, and is likely to be large by the time 
of diagnosis. Nonsmall cell lung cancer (NSCLC) is made up of three general subtypes of 
lung cancer. Epidermoid carcinoma (also called squamous cell carcinoma) usually starts in 
one of the larger bronchial tubes and grows relatively slowly. The size of these tumors can 
range from very small to quite large. Adenocarcinoma starts growing near the outside 
surface of the lung and can vary in both size and growth rate. Some slowly growing 
adenocarcinomas are described as alveolar cell cancer. Large cell carcinoma starts near the 
surface of the lung, grows rapidly, and the growth is usually fairly large when diagnosed. 
Other less common forms of ljrng cancer are carcinoid, cylindroma, mucoepidermoid, and 
malignant mesothelioma. 

Currently, CT scans, MRIs, X-rays, sputum cytology, and biopsies are used to 
diagnose nonsmall cell lung cancer. The form and cellular origin of the lung cancer is 
diagnosed primarily through biopsy from either a surgical biopsy or a needle aspiration of 
lung tissue, and usually the biopsy is prompted from an abnormality identified on an X-ray. 
In some cases, sputum cytology can reveal lung cancers in patients with normal X-rays or 
can determine the type of lung cancer, but because it cannot pinpoint the tumor's location, a 
positive sputum cytology test is usually followed by further tests. Since these tests are based 
in large part on gross morphology of the tissue, the diagnosis of a particular kind of tumor is 
largely subjective, and the diagnosis can vary significantly between clinicians. 

The polynucleotides of the invention can be used to distinguish types of lung cancer 
as well as identifying traits specific to a certain patient's cancer. For example, if the patient's 
biopsy expresses a polynucleotide that is associated with a low metastatic potential, it may 
justify leaving a larger portion of the patient's lung in surgery to remove the lesion. 
Alternatively, a smaller lesion with expression of a polynucleotide that is associated with 
high metastatic potential may justify a more radical removal of lung tissue and/or the 
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surrounding lymph nodes, even if no metastasis can be identified through pathological 

examination. 

Similarly, the expression of polynucleotides of the invention can be used in the 
diagnosis, prognosis and management of colorectal cancer. The differential expression of a 
5 polynucleotide in hyperplasia can be used as a diagnostic marker for metastatic lung cancer. 
The polynucleotides of the invention that would be especially useful for this purpose are 
those that exhibit differential expression between high metastatic versus low metastatic lung 
cancer , i.e. SEQ ID NOS: 9, 34, 42, 62, 74, 106, 1 19, 135, 154, 160, 260, 308, 323, 349, 
361, 369, 371, 381, 395, and 400. Detection of malignant lung cancer with a higher 
1 0 metastatic potential can be determined using expression levels of any of these sequences 
alone or ^combination with the levels of expression of other known genes. 

Breast Cancer. The National Cancer Institute (NCI) estimates that about 1 in 8 
women in the United States will develop breast cancer during her lifetime. Clinical breast 
examination and mammography are recommended as combined modalities for breast cancer 
1 5 screening, and the nature of the cancer will often depend upon the location of the tumor and 
the cell type from which the tumor is derived. The majority of breast cancers are 
adenocarcinomas subtypes, which can be summarized as follows: 

Ductal carcinoma in situ (DCIS): Ductal carcinoma in situ is the most common type 
of noninvasive breast cancer. In DCIS, the malignant cells have not metastasized through 
20 the walls of the ducts into the fatty tissue of the breast Comedocarcinoma is a type of DCIS 
that is more likely than other types of DCIS to come back in the same area after 
lumpectomy. It is more closely linked to eventual development of invasive ductal carcinoma 
than other forms of DCIS. 

Infiltrating (or invasive) ductal carcinoma (IDC): this type of cancer has metastasized 
25 through the wall of the duct and invaded the fatty tissue of the breast At this point, it has the 
potential to use the lymphatic system and bloodstream for metastasis to more distant parts of 
the body. Infiltrating ductal carcinoma accounts for about 80% of breast cancers. 

Lobular carcinoma in situ (LCIS): While not a true cancer, LCIS (also called lobular 
neoplasia) is sometimes classified as a type of noninvasive breast cancer. It does not 
30 penetrate through the wall of the lobules. Although it does not itself usually become an 
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invasive cancer, women with this condition have a higher risk of developing an invasive 

breast cancer in the same breast, or in the opposite breast. 

Infiltrating (or invasive) lobular carcinoma (ILC): ILC is similar to IDC, in that it has 
the potential metastasize elsewhere in the body. About 10% to 15% of invasive breast 
5 cancers are invasive lobular carcinomas, ILC can be more difficult to detect by 
mammogram than IDC. 

Inflammatory breast cancer: This rare type of invasive breast cancer accounts for 
about 1% of all breast cancers and is extremely aggressive. Multiple skin symptoms 
associated with this cancer are caused by cancer cells blocking lymph vessels or channels in 
0 the skin over the breast 

Medullary carcinoma: This special type of infiltrating breast cancer has a relatively 
well defined, distinct boundary between tumor tissue and normal tissue. It accounts for 
about 5% of breast cancers. The prognosis for this kind of breast cancer is better than for 
other types of invasive breast cancer. 
5 Mucinous carcinoma: This rare type of invasive breast cancer originates from mucus- 

producing cells. The prognosis for mucinous carcinoma is better than for the more common 
types of invasive breast cancer. 

Paget' s disease of the nipple: This type of breast cancer starts in the ducts and spreads 
to the skin of the nipple and the areola. It is a rare type of breast cancer, occurring in only 
0 1% of all cases. Pagefs disease can be associated with in situ carcinoma, or with infiltrating 
breast carcinoma. If no lump can be felt in the breast tissue, and the biopsy shows DCIS but 
no invasive cancer, the prognosis is excellent 

Phyllodes tumor: This very rare type of breast tumor forms from the stroma of the 
breast, in contrast to carcinomas which develop in the ducts or lobules. Phyllodes~<also 
5 spelled phylloides) tumors are usually benign, but are malignant on rare occasions. 

Nevertheless, malignant phyllodes tumors are very rare and less than 10 women per year in 
the US die of this disease. Benign phyllodes tumors are successfully treated by removing the 
mass and a narrow margin of normal breast tissue. 

Tubular carcinoma: Accounting for about 2% of all breast cancers, tubular 
* carcinomas are a special type of infiltrating breast carcinoma. They have a better prognosis 
than usual infiltrating ductal or lobularcarcinomas. 
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High-quality mammography combined with clinical breast exam remains the only 

screening method clearly tied to reduction in breast cancer mortality. Lower dose x-rays, 

digitized computer rather than film images, and the use of computer programs to assist 

diagnosis, are almost ready for widespread dissemination. Other technologies also are being 

developed, including magnetic resonance imaging and ultrasound. In addition, a very low 

radiation exposure technique, positron emission tomography has the potential for detecting 

early breast cancer. 

It is also possible to differentiate between non-cancerous breast tissue and malignant 
breast tissue by analyzing differential gene expression between tissues. In addition, there 
may be several possible alterations that lead to the various possible types of breast cancer. 
The different types of breast tumors (e.g., invasive vs. non-invasive, ductal vs. axillary 
lymph node) can be differentiable from one another by the identification of the differences in 
genes expressed by different types of breast tumor tissues (Porter-Jordan et a/., Hematol 
Oncol Clin North Am (1994)5:73). Breast cancer can thus be generally diagnosed by 
1 5 detection of expression of a gene or genes associated with breast tumors. Where enough 
information is available about the differential gene expression between various types of 
breast tumor tissues, the specific type of breast tumor can also be diagnosed. 

For example, increased estrogen receptor (ER) expression in normal breast 
epithileum, while not itself indicative of malignant tissue, is a known risk marker for 
20 development of breast cancer. Khan SAe/ a/., Cancer Res (1994) 54:993. Malignant breast 
cancer is often divided into two groups, ER-positive and ER-negative, based on the estrogen 
receptor status of the tissue. The ER status represents different survival length and response 
to honnone^therapy, and is thought to represent either 1) an indicator of different stages of 
the disease^or 2) an indicator that allows differentiation between two similar but distinct 
25 diseases. K. Zhu et aL, Med Hypoth. (1997) 4P:69. A number of other genes are known to 
vary expression between either different stages of cancer or different types of similar breast 
cancer. 

Similarly, the expression of polynucleotides of the invention can be used in the 
diagnosis and management of breast cancer. The differential expression of a polynucleotide 
30 in human breast tumor tissue can be used as a diagnostic marker for human breast cancer. 
The polynucleotides of the invention that would be especially useful for this purpose are 
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those that exhibit differential expression between breast cancer tissue with a high metastatic 

potential and a low metastatic potential, i.e. SEQ ID NOS: 9, 42, 52, 62, 65, 66, 68, 1 14, 

123, 144, 172, 178, 214, 219, 223, 258, 317, and 379. Detection of breast cancer can be 

determined using expression levels of any of these sequences alone or in combination. 

5 . Determination of the aggressive nature and/or the metastatic potential of a breast cancer can 

also be determined by comparing levels of one or more polynucleotides of the invention and 

comparing levels of another sequence known to vary in cancerous tissue, e.g. ER expression. 

In addition, development of breast cancer can be detected by examining the ratio of SEQ ID 

NO: to the levels of steroid hormones {e.g. , testosterone or estrogen) or to other hormones 

10 (e.g. f growth hormone, insulin). Thus expression of specific marker polynucleotides can 

be used to discriminate between normal and cancerous breast tissue, to discriminate between 

breast cancers with different cells of origin, to discriminate between breast cancers with 

different potential metastatic rates, etc. 

Diagnosis of breast cancer can also involve comparing the expression of a 

15 polynucleotide of the invention with the expression of other sequences in non-malignant 

breast tissue samples in comparison to one or more forms of the diseased tissue. A 

comparison of expression of one or more polynucleotides of the invention between the 

samples provides information on relative levels of these polynucleotides as well as the ratio 

of these polynucleotides to the expression of other sequences in the tissue of interest 

20 compared to normal. 

This risk of breast cancer is elevated significantly by the presence of an inherited risk 

for breast cancer, such as a mutation in BRCA-1 or BRCA-2. New diagnostic tools are 

being developed to address the needs of higher risk patients to complement mammography 

and physical examinations for early detection of breast cancer, particularly among younger 

73 women. The presence of antigen or expression markers in nipple aspirate fluid (NAF) 

samples collected from one or both breasts can be useful for useful for risk assessment or 

early cancer detection. Breast cytology and biomarkers obtained by random fine needle 

aspiration have been used to identify hyperplasia with atypia and overexpression of p53 and 

EGFR. The polynucleotides of the invention can be used in multivariate analysis with 

0 expression studies with genes such as p53 and EGFR as risk predictors and as surrogate 

endpoint biomarkers for breast cancer. 
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As well as being used for diagnosis and risk assessment, the expression of certain 
genes can also correlated to prognosis of a disease state. The expression of particular gene 
have been used as prognostic indicators for breast cancer including increased expression of 
c-erbB-2, pS2, ER, progesterone receptor, epidermal growth factor receptor (EGFR), neu, 
5 myc, bcl-2, int2, cytosolic tyrosine kinase, cyclin E, prad-1, hst, uPA, P AI- 1 , PAI-2, 

cathepsin D, as well as the presence of a number of cancer-specific antigens, e.g. CEA, CA 
M26, CA M29 and CA 15.3. Davis, Br. J. BiomedSci. (1996) 55:157. Poor prognosis has 
also been linked to a decrease in expression of certain genes, such as p53, Rb, nm23. The 
expression of the polynucleotides of the invention can be of prognostic value for determining 
1 0 the metastatic potential of a malignant breast cancer, as this molecules are differentially 
expressedietween high and low metastatic potential tissues tumors. The levels of these 
polynucleotides in patients with malignant breast cancer can compared to normal tissue, 
malignant tissue with a known high potential metastatic level, and malignant tissue with a 
known lower level of metastatic potential to provide a prognosis for a particular patient. 
1 5 Such a prognosis is predictive of the extent and nature of the cancer. The determined 
prognosis is useful in determining the prognosis of a patient with breast cancer, both for 
initial treatment of the disease and for longer-term monitoring of the same patient If 
samples are taken from the same individual over a period of time, differences in 
polynucleotide expression that are specific to that patient can be identified and closely 
10 watched. 

Colon Cancer. Colorectal cancer is one of the most common neoplasms in humans 
and perhaps the most frequent form of hereditary neoplasia. Prevention and early detection 
are key factors in controlling and curing colorectal cancer. Indeed, colorectal cancer is the 
second mostgjreventable cancer, after lung cancer. Colorectal cancer begins as polyps, 
which are small, benign growths of cells that form on the inner lining of the colon. Over a 
period of several years/some of these polyps accumulate additional mutations and become 
cancerous. About 20 percent of all cases of colon cancer are thought to be related to 
heredity. Currently, multiple familial colorectal cancer disorders have been identified, which 
are summarized as follows: 

Familial adenomatous polyposis (FAP): This condition results in a person having 
hundreds or even thousands of polyps in the colon and rectum that usually first appear during 
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the teenage years. Cancer nearly always develops in one or more of these polyps between 

the ages of 30 and 50. 

Gardner's syndrome: Like FAP, Gardner's syndrome results in polyps and colorectal 

cancers that develop at a young age. It can also cause benign tumors of the skin, soft 
5 connective tissue and bones. 

Hereditary nonpolyposis colon cancer (HNPCC): People with this condition tend to 

develop colorectal cancer at a young age, without first having many polyps. HNPCC has an 

autosomal dominant pattern of inheritance with variable but high penetrance estimated to be 

about 90%. HNPCC underlies 0.5%- 10% of all cases of colorectal cancer. An understanding 
1 0 of the mechanisms behind the development of HNPCC is emerging, and genetic 

presymptomatic testing, now being conducted in research settings, soon will be available on 

a widespread basis for individuals identified at risk for this disease. 

Familial colorectal cancer in Ashkenazi Jews: Recent research has found an inherited 

tendency to developing colorectal cancer among some Jews of Eastern European descent. 
1 5 Like people with FAP, Gardner's syndrome, and HNPCC, their increased risk is due to an 

inherited mutation present in about 6% of American Jews. 

Several tests are currently used to screen for colorectal cancer, including digital rectal 

examination, fecal occult blood test, sigmoidoscopy, colonoscopy, virtual colonoscopy and 

MRL Each of these tests identifies potential colorectal cancer lesions, or a risk of 
10 development of these lesions, at a fairly gross morphological level. 

The sequential alteration of a number of genes is associated with malignant 

adenocarcinoma, including the genes DCC, p53, ras, and FAP. For a review, see e.g. Fearon 

ER, et aL 9 Cell (1990) 61(5):759; Hamilton SR et al y Cancer (1993) 72:957; Bodmer W, et 

aL, Nat Genet (1994) 4(3)2.17; Fearon ER, Ann N YAcadScL (1995) 768:101, Molecular 
\5 genetic alterations are thus promising as potential diagnostic and prognostic indicators in 

colorectal carcinoma and molecular genetics of colorectal carcinoma since it is possible to 

differentiate between different types of colorectal neoplasias using molecular markers. 

Colorectal cancer can thus be generally diagnosed by detection of expression of a gene or 

genes associated with colorectal tumors. 
J Similarly, the expression of polynucleotides of the invention can be used in the 

diagnosis, prognosis and management of colorectal cancer. The differential expression of a 

72 



WO 99/33982 PCT/US98/27610 
polynucleotide in hyperplasia can be used as a diagnostic marker for colon cancer. The 

polynucleotides of the invention that would be especially useful for this purpose are those 
that exhibit differential expression between malignant metastatic colon cancer and normal 
patient tissue , i.e. SEQ ID NOS: 52, 1 19, 172, 288. Detection of malignant colon cancer 
5 can be determined using expression levels of any of these sequences alone or in combination 
with the levels of expression. 

Determination of the aggressive nature and/or the metastatic potential of a colon 
cancer can also be determined by comparing levels of one or more polynucleotides of the 
invention and comparing total levels of another sequence known to vary in cancerous tissue, 
10 e.g.- P 53 expression. In addition, development of colon cancer can be detected by examining 
the ratio of any of the polynucleotides of the invention to the levels of oncogenes (e.g. ras) 
or tumor suppressor genes (e.g. FAP or p53). Thus expression of specific marker 
polynucleotides can be used to discriminate between normal and cancerous breast tissue, to 
discriminate between breast cancers with different cells of origin, to discriminate between 
15 breast cancers with different potential metastatic rates, etc. 

G - Use of Polynucleotides t o Screen for Peptide Analogs and Antap nnigtc 
Polypeptides encoded by the instant polynucleotides and corresponding full length 
genes can be used to screen peptide libraries to identify binding partners, such as receptors, 
from among the encoded polypeptides. 
20 A library of peptides can be synthesized following the methods disclosed in U.S. Pat 

No. 5,010,175 C 175J, and in WO 91/17823. As described below in brief, one prepares a 
mixture of peptides, which is then screened to identify the peptides exhibiting the desired 
signal transduction and receptor binding activity. In the '175 method, a suitable peptide 
synthesis support (e.g., a resin) is coupled to a mixture of appropriately protected, activated 
25 amino acids. The concentration of each amino acid in the reaction mixture is balanced or 

adjusted in inverse proportion to its coupling reaction rate so that the product is an equimolar 
mixture of amino acids coupled to the starting resin. The bound amino acids are then 
deprotected, and reacted with another balanced amino acid mixture to form an equimolar 
mixture of all possible dipeptides. This process is repeated until a mixture of peptides of the 
30 desired length (e.g. , hexamers) is formed. Note that one need not include all amino acids in 
each step: one can include only one or two amino acids in some steps (e.g., where it is 
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known that a particular amino acid is essential in a given position), thus reducing the 

complexity of the mixture. After the synthesis of the peptide library is completed, the 

mixture of peptides is screened for binding to the selected polypeptide. The peptides are 

then tested for their ability to inhibit or enhance activity. Peptides exhibiting the desired 

5 activity are then isolated and sequenced. 

The method described in WO 91/17823 is similar. However, instead of reacting the 

synthesis resin with a mixture of activated amino acids, the resin is divided into twenty equal 

portions (or into a number of portions corresponding to the number of different amino acids 

to be added in that step), and each amino acid is coupled individually to its portion of resin. 

1 0 The resin portions are then combined, mixed, and again divided into a number of equal 

portions for reaction with the second amino acid. In this manner, each reaction can be easily 
driven to completion. Additionally, one can maintain separate "subpools* 1 by treating 
portions in parallel, rather than combining all resins at each step. This simplifies the process 
of determining which peptides are responsible for any observed receptor binding or signal 

1 5 transduction activity. 

In such cases, the subpools containing, e.g., 1-2,000 candidates each are exposed to 
one or more polypeptides of the invention. Each subpool that produces a positive result is 
then resynthesized as a group of smaller subpools (sub-subpools) containing, e.g., 20-100 
candidates, and reassayed. Positive sub-subpools can be resynthesized as individual 

20 compounds, and assayed finally to determine the peptides that exhibit a high binding 
constant. These peptides can be tested for their ability to inhibit or enhance the native 
activity. The methods described in WO 91/7823 and U.S. Patent No. 5,194,392 (herein 
incorporated by reference) enable the preparation of such pools and subpools by automated 
techniques in parallel, such that all synthesis and resynthesis can be performed in a matter of 

25 days. 

Peptide agonists or antagonists are screened using any available method, such as 
signal transduction, antibody binding, receptor binding, autogenic assays, chemotaxis 
assays, etc. The methods described herein are presently preferred. The assay conditions 
ideally should resemble the conditions under which the native activity is exhibited in vivo, 
30 that is, under physiologic pH, temperature, and ionic strength. Suitable agonists or 
antagonists will exhibit strong inhibition or enhancement of the native activity at 
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concentrations that do not cause toxic side effects in the subject. Agonists or antagonists that 
compete for binding to the native polypeptide can require concentrations equal to or greater 
than the native concentration, while inhibitors capable of binding irreversibly to the 
polypeptide can be added in concentrations on the order of the native concentration. 

The end results of such screening and experimentation will be at least one novel 
polypeptide binding partner, such as a receptor, encoded by a gene or a cDNA corresponding 
to a polynucleotide of the invention, and at least one peptide agonist or antagonist of the 
novel binding partner. Such agonists and antagonists can be used to modulate, enhance, or 
inhibit receptor function in cells to which the receptor is native, or in cells that possess the 
receptor as a result of genetic engineering. Further, if the novel receptor shares biologically 
important characteristics with a known receptor, information about agonist/antagonist 
binding can facilitate development of improved agonists/antagonists of the known receptor. 
H - Pharmaceutical Co mpositions and Therapeutic Uses 
Pharmaceutical compositions can comprise polypeptides, antibodies, or 
polynucleotides of the claimed invention. The pharmaceutical compositions will comprise a 
therapeutically effective amount of either polypeptides, antibodies, or polynucleotides of the 
claimed invention. 

« 

The term "therapeutically effective amount" as used herein refers to an amount of a 
therapeutic agent to treat, ameliorate, or prevent a desired disease or condition, or to exhibit a 
detectable therapeutic or preventative effect The effect can be detected by, for example, 
chemical markers or antigen levels. Therapeutic effects also include reduction in physical 
symptoms, such as decreased body temperature. The precise effective amount for a subject 
will depend upon the subject's size and health, the nature and extent of the condition, and the 
therapeutics or combination of therapeutics selected for administration. Thus, it is not useful 
25 to specify an exact effective amount in advance. However, the effective amount for a given 
situation is determined by routine experimentation and is within the judgment of the 
clinician. For purposes of the present invention, an effective dose will generally be from 
about 0.01 mg/ kg to 50 mg/kg or 0.05 mg/kg to about 10 mg/kg of the DNA constructs in 
the individual to which it is administered. 
30 A pharmaceutical composition can also contain a pharmaceutically acceptable carrier. 

The term "pharmaceutically acceptable carrier" refers to a carrier for administration of a 
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therapeutic agent, such as antibodies or a polypeptide, genes, and other therapeutic agents. 

The term refers to any pharmaceutical carrier that does not itself induce the production of 

antibodies harmful to the individual receiving the composition, and which can be 

administered without undue toxicity. Suitable carriers can be large, slowly metabolized 

5 macromolecules such as proteins, polysaccharides, polylactic acids, polyglycolic acids, 

polymeric amino acids, amino acid copolymers, and inactive virus particles. Such carriers 

are well known to those of ordinary skill in the art. 

Pharmaceutically acceptable salts can be used therein, for example, mineral acid salts 

such as hydrochlorides, hydrobromides, phosphates, sulfates, and the like; and the salts of 

10 organic acids such as acetates, propionates, malonates, benzoates, and the like. A thorough 
discussion of pharmaceutically acceptable excipients is available in Remington 's 
Pharmaceutical Sciences (Mack Pub. Co., N J. 1 99 1 ). 

Pharmaceutically acceptable carriers in therapeutic compositions can include liquids 
such as water, saline, glycerol and ethanol. Auxiliary substances, such as wetting or 

15 emulsifying agents, pH buffering substances, and the like, can also be present in such 

vehicles. Typically, the therapeutic compositions are prepared as injectables, either as liquid 
solutions or suspensions; solid forms suitable for solution in, or suspension in, liquid 
vehicles prior to injection can also be prepared. Liposomes are included within the 
definition of a pharmaceutically acceptable carrier. 

20 Delivery Methods. Once formulated, the compositions of the invention can be 

(1) administered directly to the subject (e.g., as polynucleotide or polypeptides); (2) 
delivered ex vivo, to cells derived from the subject (e.g., as in ex vivo gene therapy); or (3) 
delivered in vitro for expression of recombinant proteins (e.g. , polynucleotides). Direct 
delivery of the compositions will generally be accomplished by injection, either 

25 subcutaneously, intraperitoneally, intravenously or intramuscularly, or delivered to the 
interstitial space of a tissue. The compositions can also be administered into a tumor or 
lesion. Other modes of administration include oral and pulmonary administration, 
suppositories, and transdermal applications, needles, and gene guns or hyposprays. Dosage 
treatment can be a single dose schedule or a multiple dose schedule. 

30 Methods for the ex vivo delivery and reimplantation of transformed cells into a 

subject are known in the art and described in e.g., International Publication No. WO 
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93/14778. Examples of cells useful in ex vivo applications include, for example, stem cells, 

particularly hematopoetic, lymph cells, macrophages, dendritic cells, or tumor cells. 
Generally, delivery of nucleic acids for both ex vivo and in vitro applications can be 
accomplished by, for example, dextran-mediated transfection, calcium phosphate 
precipitation, polybrene mediated transfection, protoplast fusion, electroporation, 
encapsulation of the polynucleotide(s) in liposomes, and direct microinjection of the DNA 
into nuclei, all well known in the art. 

Once a gene corresponding to a polynucleotide of the invention has been found to 
correlate with a proliferative disorder, such as neoplasia, dysplasia, and hyperplasia, the 
disorder can be amenable to treatment by administration of a therapeutic agent based on the 
providedipolynucleotide or corresponding polypeptide. 

Preparation of antisense polynucleotides is discussed above. Neoplasias that are 
treated with the antisense composition include, but are not limited to, cervical cancers, 
melanomas, colorectal adenocarcinomas, Wilms* tumor, retinoblastoma, sarcomas, 
1 5 myosarcomas, lung carcinomas, leukemias, such as chronic myelogenous leukemia, 

promyelocytic leukemia, monocytic leukemia, and myeloid leukemia, and lymphomas, such 
as histiocytic lymphoma. Proliferative disorders that are treated with the therapeutic 
composition include disorders such as anhydric hereditary ectodermal dysplasia, congenital 
alveolar dysplasia, epithelial dysplasia of the cervix, fibrous dysplasia of bone, and 
mammary dysplasia. Hyperplasias, for example, endometrial, adrenal, breast, prostate, or 
thyroid hyperplasias or pseudoepitheliomatous hyperplasia of the skin, are treated with 
antisense therapeutic compositions based upon a polynucleotide of the invention. Even in 
disorders in which mutations in the corresponding gene are not implicated, downregulation 
or inhibition of expression of a gene corresponding to a polynucleotide of the invention can 
25 have therapeutic application. For example, decreasing gene expression can help to suppress 
tumors in which enhanced expression of the gene is implicated. 

Both the dose of the antisense composition and the means of administration are 
determined based on the specific qualities of the therapeutic composition, the condition, age, 
and weight of the patient, the progression of the disease, and other relevant factors. 
30 Administration of the therapeutic antisense agents of the invention includes local or systemic 
administration, including injection, oral administration, particle gun or catheterized 
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administration, and topical administration. Preferably, the therapeutic antisense composition 

contains an expression construct comprising a promoter and a polynucleotide segment of at 

least 12, 22, 25, 30, or 35 contiguous nucleotides of the antisense strand of a polynucleotide 

disclosed herein. Within the expression construct, the polynucleotide segment is located 

downstream from the promoter, and transcription of the polynucleotide segment initiates at 

the promoter. 

Various methods are used to administer the therapeutic composition directly to a 
specific site in the body. For example, a small metastatic lesion is located and the 
therapeutic composition injected several times in several different locations within the body 
of tumor. Alternatively, arteries which serve a tumor are identified, and the therapeutic 
composition injected into such an artery, in order to deliver the composition directly into the 
tumor. A tumor that has a necrotic center is aspirated and the composition injected directly 
into the now empty center of the tumor. The antisense composition is directly administered 
to the surface of the tumor, for example, by topical application of the composition. X-ray 
imaging is used to assist in certain of the above delivery methods. 

Receptor-mediated targeted delivery of therapeutic compositions containing an 
antisense polynucleotide, subgenomic polynucleotides, or antibodies to specific tissues is 
also used* Receptor-mediated DNA delivery techniques are described in, for example, 
Findeis et al^ Trends BiotechnoL (1993) 77:202; Chiou et aL, Gene Therapeutics: Methods 
And Applications Of Direct Gene Transfer (JJL Wolff, ed.) (1994); Wu et aL, J. Biol Chenu 
(1988) 263:621; Wu et aL, J. BioL Ghent (1994) 269:542; Zenke et aL, Proc. NatL Acad 
ScL (USA) (1990) 57:3655; Wu et cd., J. BioL Chenu (1991) 25tf:338. Preferably, receptor- 
mediated targeted delivery of therapeutic compositions containing antibodies of the 
invention is used to deliver the antibodies to specific tissue. 

Therapeutic compositions containing antisense subgenomic polynucleotides are 
administered in a range of about 100 ng to about 200 mg of DNA for local administration in 
a gene therapy protocol. Concentration ranges of about 500 ng to about 50 mg, about 1 fig to 
about 2 mg, about 5 \x% to about 500 |ig, and about 20 jig to about 100 |ig of DNA can also 
be used during a gene therapy protocol. Factors such as method of action and efficacy of 
transformation and expression are considerations which will affect the dosage required for 
ultimate efficacy of the antisense subgenomic polynucleotides. Where greater expression is 
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desired over a larger area of tissue, larger amounts of antisense subgenomic polynucleotides 

or the same amounts readministered in a successive protocol of administrations, or several 

administrations to different adjacent or close tissue portions of, for example, a tumor site, 

may be required to effect a positive therapeutic outcome. In all cases, routine 

experimentation in clinical trials will determine specific ranges for optimal therapeutic 

effect. A more complete description of gene therapy vectors, especially retroviral vectors, is 

contained in U.S. Serial No. 08/869,309, which is expressly incorporated herein, and in 

section G below. 

For polynucleotide-related genes encoding polypeptides or proteins with anti- 
inflammatory activity, suitable use, doses, and administration are described in U.S. Patent 
No. 5,654,173. Therapeutic agents also include antibodies to proteins and polypeptides 
encoded by the polynucleotides of the invention and related genes, as described in U.S. 
Patent No. 5,654,173. 

I- Gene Therapy 

The therapeutic polynucleotides and polypeptides of the present invention can be 
utilized in gene delivery vehicles. The gene delivery vehicle can be of viral or non-viral 
origin (see generally, Jolly. Cancer Gene Therapy (1994) 7:51; Kimura. Human Gene 
Therapy (1994) 5:845; Connelly, Human Gene Therapy (1995) 7:185; and Kapiitt, Nature 
Genetics (1994) 6: 148). Gene therapy vehicles for delivery of constructs including a coding 
sequence of a therapeutic of the invention can be administered either locally or systemically. 
These constructs can utilize viral or non-viral vector approaches. Expression of such coding 
sequences can be induced using endogenous mammalian or heterologous promoters. 
Expression of the coding sequence can be either constitutive or regulated. 

The present invention can employ recombinant retroviruses which are constructed to 
cany or express a selected nucleic acid molecule of interest Retrovirus vectors that can be 
employed include those described in EP 0 415 73 1; WO 90/07936; WO 94/03622; WO 
93/25698; WO 93/25234; U.S ? Patent No. 5, 219,740; WO 93/1 1230; WO 93/10218; Vile 
and Hart, Cancer Res. (1993) 55:3860; Vile et al., Cancer Res. (1993) 53:962; Ram et al., 
Cancer Res. (1993) 55:83; Takamiya et aL, J. NeuroscL Res. (1992) 55:493; Baba et al., J. 
Neurosurg. (1993) 7P:729; U.S. Patent No. 4,777,127; GB Patent No. 2,200,65 1; and EP 0 
345 242. Preferred recombinant retroviruses include those described in WO 91/02805. 
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Packaging cell lines suitable for use with the above-described retroviral vector 

constructs can be readily prepared (see, e.g. ? WO 95/30763 and WO 92/05266), and used to 

create producer cell lines (also termed vector cell lines) for the production of recombinant 

vector particles. Within particularly preferred embodiments of the invention, packaging cell 

lines are made from human (such as HT1080 cells) or mink parent cell lines, thereby 

allowing production of recombinant retroviruses that can survive inactivation in human 

serum. 

The present invention also employs alphavirus-based vectors that can function as 
gene delivery vehicles. Such vectors can be constructed from a wide variety of alphaviruses, 
including, for example, Sindbis virus vectors, Semliki forest virus (ATCC VR-67; ATCC 
VR-1247), Ross River virus (ATCC VR-373; ATCC VR-1246) and Venezuelan equine 
encephalitis virus (ATCC VR-923; ATCC VR-1250; ATCC VR 1249; ATCC VR-532). 
Representative examples of such vector systems include those described in U.S. Patent Nos. 
5,091,309; 5,217,879; and 5,185,440; WO 92/10578; WO 94/21792; WO 95/27069; WO 
95/27044; and WO 95/07994. Gene delivery vehicles of the present invention can also 
employ parvovirus such as adeno-associated virus (AAV) vectors. Representative examples 
include the AAV vectors disclosed by Srivastava in WO 93/09239, Samulski et al., J. Virol. 
(1989) 55:3822; Mendelson et al., Virol (1988) J 66:154; and Flotte et aL, PNAS (1993) 
P0:1O613. 

Representative examples of adenoviral vectors include those described by Berkner, 
Biotechniques (1988) #616; Rosenfeld et aL, Science (1991) 252:431; WO 93/19191; Koils 
et al., PNAS (1994) 97:215; Kass-Eisler et aL, PNAS (1993) 90:1 1498; Guzman*/ aL, 
Circulation (1993) 55:2838; Guzman et aL.. Cir. Res. (1993) 75:1202; Zabner et aL, Cell 
(1993) 75:207; Li et aL, Hum. Gene Ther. (1993) ¥:403; Cailaud et aL,Eur.J. Neurosci. 
(1993)5:1287; Vincent et aL. Nat. Genet. (1993)5:130; Jaffe etaL, Nat Genet. (1992) 
7:372; and Levrero et aL, Gene (1991) 707:195. Exemplary adenoviral gene therapy vectors 
employable in this invention also include those described in WO 94/12649, WO 93/03769; 
WO 93/19191; WO 94/28938; WO 95/1 1984 and WO 95/00655. Administration of DNA 
linked to killed adenovirus as described in Curiel, Hum. Gene Ther. (1992) J: 147 can be 
employed. 
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Other gene delivery vehicles and methods can be employed, including polycationic 

condensed DNA linked or unlinked to killed adenovirus alone, for example Curiel, Hum. 
Gene Ther. (1992) J: 147; ligand linked DNA, for example see Wu, J. Biol. Chetn. (1989) 
26V: 16985; eukaryotic cell delivery vehicles cells, for example see U.S. Pat No. 5,814,482; 
5 WO 95/07994; WO 96/1 7072; WO 95/30763; and WO 97/42338; deposition of 

photopolymerized hydrogel materials; hand-held gene transfer particle gun, as described in 
U.S. Patent No. 5,149,655; ionizing radiation as described in U.S. Patent No. 5,206,152 and 
in W092/1 1033; nucleic charge neutralization or fusion with cell membranes. Additional 
approaches are described in Philip, MoL Cell Biol. (1994) 14:241 1, and in Woffendin, Proc. 
10 Natl. Acad. Sci. (1994) 97:1581. 

Naked DNA can also be employed. Exemplary naked DNA introduction methods are 
described in WO 90/1 1092 and U.S. Patent No. 5,580,859. Uptake efficiency can be 
improved using biodegradable latex beads. DNA coated latex beads are efficiently 
transported into cells after endocytosis initiation by the beads. The method can be improved 
15 further by treatment of the beads to increase hydrophobicity and thereby facilitate disruption 
of the endosome and release of the DNA into the cytoplasm. Liposomes that can act as gene 
delivery vehicles are described in U.S. Patent No. 5,422,120; WO 95/13796; WO 94/23697; 
WO 91/14445; and EP 0524968. 

Further non-viral delivery suitable for use includes mechanical delivery systems such 
20 as the approach described in Woffendin et aL, Proc. NatL Acad Set USA (1994) 

91 (24):1 158 L Moreover, the coding sequence and the product of expression of such can be 
delivered through deposition of photopolymerized hydrogel materials. Other conventional 
methods fo£gene delivery that can be used for delivery of the coding sequence include, for 
example, use of hand-held gene transfer particle gun, as described in U.S. Patent No. 
25 5,149,655; use of ionizing radiation for activating transferred gene, as described in U.S. 
Patent No. 5,206,152 and WO 92/1 1033. 

The present invention will now be illustrated by reference to the following examples 
which set forth particularly advantageous embodiments. However, it should be noted that 
these embodiments are illustrative and are not to be construed as restricting the invention in 
30 any way. 
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EXAMPLES 

The present invention is now illustrated by reference to the following examples 
which set forth particularly advantageous embodiments. However, these embodiments are 
illustrative and are not meant to be construed as restricting the invention in any way. 

Example 1: Source of Biological Materials and Overview of Novel Polynucleotides 
Expressed by the Biological Materials 
Human colon cancer cell line Kml2L4-A (Morika, W. A. K. et aL, Cancer Research 
(1988) ¥5:6863) was used to construct a cDNA library from mRNA isolated from the cells. 
As described in the above overview, a total of 4,693 sequences expressed by the Kml2L4-A 
cell line were isolated and analyzed; most sequences were about 275-300 nucleotides in 
length. The KM12L4-A cell line is derived from the KM12C cell line. The KM12C cell 
line, which is poorly metastatic (low metastatic) was established in culture from a Dukes' 
stage B2 surgical specimen (Morikawa etai Cancer Res. (1988) 45:6863). The KML4-A is 
a highly metastatic Subline derived from KM12C (Yeatman et ah Nuch Acids. Res. (1995) 
23:4007; Bao-Ling etah Proc. Annu. Meet. Am. Assoc. Cancer. Res. (1995) 27:3269). The 
KM12C and KM12C-derived cell lines (e.g., KM12L4, KM12L4-A, etc.) are well- 
recognized in the art as a model cell line for the study of colon cancer (see, e.g., Moriakawa 
etah, supra; Radinsky et ah Clin. Cancer Res. (1995) 7:19; Yeatman et aL 9 (1995) supra; 
Yeatman et ah Clin. Exp. Metastasis (1996) 142.46). 

The sequences were first masked to eliminate low complexity sequences using the XBLAST 
masking program (Claverie "Effective Large-Scale Sequence Similarity Searches," In: 
Computer Methods for Macromolecular Sequence Analysis. Doolittle. ed» Metft Enzymoh 
266:212-227 Academic Press, NY, NY (1996); see particularly Claverie, in "Automated 
DNA Sequencing and Analysis Techniques 99 Adams et ah, eds., Chap. 36, p. 267 Academic 
Press, San Diego, 1994 and Claverie etah Comput Chent (1993) 17:191 )• Generally, 
masking does not influence the final search results, except to eliminate of relative little 
interest due to their lox complexity, and to eliminate multiple "hits" based on similarity to 
repetitive regions common to multiple sequences, e.g., Alu repeats. Masking resulted in the 
elimination of 43 sequences. The remaining sequences were then used in a BLASTN vs. 
Genbank search with search parameters of greater than 70% overlap, 99% identity, and a p 
value of less than 1 x 10" 40 , which search resulted in the discarding of 1,432 sequences- 
Sequences from this search also were discarded if the inclusive parameters were met, but the 
sequence was ribosomal or vector-derived. 
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The resulting sequences from the previous search were classified into three groups 
(1, 2 and 3 below) and searched in a BLASTX vs. NRP (non-redundant proteins) database 
search: (I) unknown (no hits in the Genbank search), (2) weak similarity (greater than 45% 
identity and p value of less than 1 x l<r s ), and (3) high similarity (greater than 60% overlap, 
greater than 80% identity, and p value less than 1 x 10 s ). This search resulted in discard of 
98 sequences as having greater than 70% overlap, greater than 99% identity, and p value of 
less than 1 x W 40 . 

The remaining sequences were classified as unknown (no hits), weak similarity, and 
high similarity (parameters as above). Two searches were performed on these sequences. 
First, a BLAST vs. EST database search resulted in discard of 1771 sequences (sequences 
with greater than 99% overlap, greater than 99% similarity and a p value of less than 1x10" 
40 ; sequences with a p value of less than 1 x 10" 65 when compared to a database sequence of 
human origin were also excluded). Second, a BLASTN vs. Patent GeneSeq database 
resulted in discard of 15 sequences (greater than 99% identity; p value less than 1 x 10"*°; 
greater than 99% overlap). 

The remaining sequences were subjected to screening using other rules and 
redundancies in the dataset Sequences with a p value of less than 1 x 10 " UI in relation to a 
database sequence of human origin were specifically excluded. The final result provided the 
404 sequences listed in the accompanying Sequence Listing. The Sequence Listing is 
arranged beginning with sequences with no similarity to any sequence in a database 
searched, and ending with sequences with the greatest similarity. Each identified 
polynucleotide represents sequence from at least a partial mRNA transcript Polynucleotides 
that were determined to be novel were assigned a sequence identification number. 

The novel polynucleotides and were assigned sequence identification numbers SEQ 
ID NOS: i -404. The DNA sequences corresponding to the novel polynucleotides are 
provided in the Sequence Listing. The majority of the sequences are presented in the 
Sequence Listing in the 5' to 3' direction. A small number, 25, are listed in the Sequence 
Listing in the 5' to 3' direction but the sequence as written is actually 3' to 5'. These 
sequences are readily identified with the designation "AR" in the Sequence Name in Table 1 
(inserted before the claims). The sequences correctly listed in the 5' to 3' direction in the 
Sequence Listing are designated "AF." The Sequence Listing filed herewith therefore 
contains 25 sequences listed in the reverse order, namely SEQ ID NOS:47, 97, 137, 171, 
173, 179, 182, 194, 200, 202, 213, 227, 258, 264, 275, 302, 313, 324, 329, 330, 331, 338, 
358, 379, and 404. 
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Because the provided polynucleotides represent partial mRNA transcripts, two or 

more polynucleotides of the invention may represent different regions of the same mRNA 

transcript and the same gene. Thus, if two or more SEQ ID NOS: are identified as belonging 

to the same clone, then either sequence can be used to obtain the full-length mRNA or gene. 

5 In order to confirm the sequences of SEQ ID NOS : 1 -404, inserts of the clones 

corresponding to these polynucleotides were re-sequenced. These "validation" sequences 

are provided in SEQ ID NOS:405-800.. These validation sequences were often longer than 

the original polynucleotide sequences. They validate, and thus often provide additional 

sequence information. Validation sequences can be correlated with the original sequences 

i 0 they validate by identifying those sequences of SEQ ID NOS: 1 -404 and the validation 

sequences of SEQ ID NOS:405-800 that share the same clone name in Table 1 . 

Example 2: Results of Public Database Search to Identify Function of Gene Products 

SEQ ID NOS: 1-404, as well as the validation sequences SEQ ID NOS:405-800, were 

5 translated in all three reading frames to determine the best alignment with the individual 
sequences. These amino acid sequences and nucleotide sequences are referred, generally, as 
query sequences, which are aligned with the individual sequences. Query and individual 
sequences were aligned using the BLAST programs, available over the world wide web at 
http://ww.ncbijilm.nih.gov/BLASTA Again the sequences were masked to various extents 

0 to prevent searching of repetitive sequences or poly-A sequences, using the XBLAST 
program for masking low complexity as described above in Example 1 . 

Table 2 (inserted before the claims) shows the results of the alignments. Table 2 
refers to each sequence by its SEQ ID NO:, the accession numbers and descriptions of 
nearest neighbors from the Genbank and Non-Redundant Protein searches, and the p values 

5 of the search results. Table 1 identifies each SEQ ID NO: by SEQ name, clone ID, and 

cluster. As discussed above, a single cluster includes polynucleotides representing the same 
gene or gene family, and generally represents sequences encoding the same gene product 

For each of SEQ ID NOS: 1-800, the best alignment to a protein or DNA sequence is 
included in Table 2. The activity of the polypeptide encoded by SEQ ID NOS: 1 -800 is the 

) same or similar to the nearest neighbor reported in Table 2. The accession number of the 
nearest neighbor is reported, providing a reference to the activities exhibited by the nearest 
neighbor. The search program and database used for the alignment also are indicated as well 
as a calculation of the p value. 
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Full length sequences or fragments of the polynucleotide sequences of the nearest 

neighbors can be used as probes and primers to identify and isolate the full length sequence 

of SEQ ID NOS: 1-800. The nearest neighbors can indicate a tissue or cell type to be used to 

construct a library for the full-length sequences of SEQ ID NOS: 1-800. 

SEQ ID NOS: 1-800 and the translations thereof may be human homologs of known 

genes of other species or novel allelic variants of known human genes. In such cases, these 

new human sequences are suitable as diagnostics or therapeutics. As diagnostics, the human 

sequences SEQ ID NOS: 1-800 exhibit greater specificity in detecting and differentiating 

human cell lines and types than homologs of other species. The human polypeptides 

10 encoded by SEQ ID NOS: 1-800 are likely to be less immunogenic when administered to 

humans than homologs from other species. Further, on administration to humans, the 

polypeptides encoded by SEQ ID NOS: 1-800 can show greater specificity or can be better 

regulated by other human proteins than are homologs from other species. 

15 Example 3: Members of Protein Families 

After conducting a profile search as described in the specification above, several of the 
polynucleotides of the invention were found to encode polypeptides having characteristics of 
a polypeptide belonging to a known protein families (and thus represent new members of 
these protein families) and/or comprising a known functional domain (Table 3). Thus the 

20 invention encompasses fragments, fusions, and variants of such polynucleotides that retain 
biological activity associated with the protein family and/or functional domain identified 
herein. 



Tabfc 3 Polynucleotides encoding gene products of a protein family or having a known 
functional domain(s). 



SEQ ID 
NO: 


Biological Activity (Profile hit) 


Start 


Stop 


Dir 


24 


4 transmembrane segments integral membrane proteins 


1218 


578 


rev 


41 


4 transmembrane segments integral membrane proteins 


1086 


413 


rev 


101 


4 transmembrane segments integral membrane proteins 


1206 


544 


rev 


1S7 


4 transmembrane segments integral membrane proteins 


721 


33 


rev 


341 


4 transmembrane segments integral membrane proteins 


1253 


613 


rev 


395 


4 transmembrane segments integral membrane proteins 


530 


10 


for 


395 


4 transmembrane segments integral membrane proteins 


696 


17 


for 


395 


4 transmembrane segments integral membrane proteins 


471 


39 


rev 


24 


7 transmembrane receptor (Secretin family) 


1301 


491 


rev 


41 


7 transmembrane receptor (Secretin family) 


1309 


10 


rev 


101 


7 transmembrane receptor (Secretin family) 


1330 


296 


rev 


157 


7 transmembrane receptor (Secretin family) 


1173 


249 


rev 


291 


7 transmembrane receptor (Secretin family) 


1400 


269 


rev 
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Table 3 Polynucleotides encoding gene products of a protein family or having a known 
functional domain(s). 



CF/l WT\ 

nu. 


Biological Activity (Profile hit) 


Start 


Stop 


Dir 


291 


7 transmembrane receptor (Secretin family) 


712 


130 


Tor 


305 


7 transmembrane receptor (Secretin family) 


926 


4 


Tor 


305 


7 transmembrane receptor (Secretin family) 


753 


55 


rev 


315 


7 transmembrane receptor (Secretin family) 


1058 


270 


ICY 


341 


7 transmembrane receptor (Secretin family) 


1265 534 


I w V 


116 |Ank repeat 


J141 


218 


ror 


25 1 Ank repeat 


290 


KJ 1 


ror 


25 1 | Ank repeat 


467 


387 


for 

ior 


63 


ATPases Associated with Various Cellular Activities 




OU 


tor 


116 


ATPases Associated with Various Cellular Activities 






for 


134 


ATPases Associated with Various Cellular Activities 


525 


57 


rev 


136 


ATPases Associated with Various Cellular Activities 


712 


163 


fnr 


1S1 


ATPases Associated with Various Cellular Activities 


719 


73 


for 


151 


ATPases Associated with Various Cellular Activities 


386 


13 


for 


384 


ATPases Associated with Various Cellular Activities 


664 


140 


for 


404 


ATPases Associated with Various Cellular Activities 


704 


52 


for 


374 (Basic region plus leucine zipper transcription factors 


298 


146 


fnr 


97 


Bromodomain (conserved sequence found in human, 
Drosophila and yeast proteins.) 


230 


63 


for 


136 


EF-hand 


121 


207 


for 


242 


EF-hand 


238 


155 


for 


379 


EF-hand 


212 


126 


for 


308 (Eukaryotic aspartyl proteases 


1300 


461 jrev 


213 |GATA family of transcription factors 


720 


377 Ifor 


367 |G-protetn alpha subunit 


971 


467 |rev 


188 


Phorbol esters/diacylglycerol binding 


91 


177 |for 


251 


Phorbol esters/diacylglycerol binding 


133 


219 for 


202 


protein kinase 


482 


1 rev 


202 


protein kinase 


970 


1 rev 


315 


protein kinase 


739 


158 for 


315 


protein kinase 


1023 


197 for 


367 


protein kinase 


1046 


285 rev 


397 


protein kinase 


511 


6 for : 


256 


Protein phosphatase 2C 


13 


90 for r 


256 


Protein phosphatase 2C 


163 


86 |for 


382 


Protein Tyrosine Phosphatase 


261 


2 |for 


306 


SH3 Domain 


141 


296 |for 


386 


SH3 Domain 


359 


209 |for 


169 


Trypsin 


764 


164 |rev 


188 


WD domain, G-beta repeats 


480 


382 |for 


188 


WD domain, G-beta repeats 


206 


117 for 


335 


WD domain, G-beta repeats 


3 


92 . |for 


23 


wnt family of developmental signaling proteins 


1151 


335 Irev 


291 


wnt family of developmental signaling proteins 


779 


89 rev 


291 


wnt family of developmental signaling proteins 


1347 


382 rev 


324 


wnt family of developmental signaling proteins 


1180 


499 rev 


330 


wnt family of developmental signaling proteins 


1180 


499 rev 


341 


wnt family of developmental signaling proteins 


1399 


560 |rev 
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Table 3 Polynucleotides encoding gene products of a protein family or having a known 
functional domain(s). 



CPA WJTX 

oEQ ID 


Biological Activity (Profile hit) 


Start IStop 


Dir 


i« 


— m 

wnt family of developmental signaling proteins 


880 |49 


rev 


loo 


WW/rsp5/WwP domain containing proteins 


431 |354 


for j 


379 


WW/rsp5AVWP domain containing proteins 


12 |89 


for j 


395 


WW/rsp5/WWP domain containing proteins 


153 |76 


for i 


395 


WW/rsp5/WWP domain containing proteins 


156 |64 - 


rv j 

tor j 


61 


Zinc finger, C2H2 type 


254 


192 


for i 


306 


Zinc finger, C2H2 type 


428 


367 


for ! 


386 


Zinc finger, C2H2 type 


191 


253 .. 


for j 


322 |Zinc finger, CCHC class 


553 


503 


for ! 


306 


Zinc-binding metal ioprotease domain 


101 


60 


rev | 


395 


Zinc-binding metalioprotease domain 


28 


69 


rev 1 



Start and stop indicate the position within the individual sequenes that align with the 
query sequence having the indicated SEQ ID NO. The direction (Dir) indicates the 
orientation of the query sequence with respect to the individual sequence, where forward 
5 (for) indicates that the alignment is in the same direction (left to right) as the sequence 
provided in the Sequence Listing and reverse (rev) indicates that the alignment is with a 
sequence complementary to the sequence provided in the Sequence Listing. 

Some polynucleotides exhibited multiple profile hits because, for example, the particular 
10 sequence contains overlapping profile regions, and/or the sequence contains two different 
functional domains. These profile hits are described in more detail below. 

a) Four Transmembrane Integral Membrane Proteins, SEQ ID NOS: 24, 41, 101, 
157, 341, and 395 correspond to a sequence encoding a polypeptide that is a member of the 4 
transmembrane segments integral membrane protein family (transmembrane 4 family). The 
15 transmembrane 4 family of proteins includes a number of evolutionarily-related eukaryotic 
cell surface antigens (Levy et aL 9 J. Biol. Chenu, (199 1) 266:14597; Tomlinson et ai, Eur. J. 
Immunol (1993) 25:136; Barclay etal The leucocyte antigen factbooks. (1993) Academic 
Press, London/San Diego). The proteins belonging to this family include: 1) Mammalian 
antigen CD9 (MIC3), which is involved in platelet activation and aggregation; 2) 
20 Mammalian leukocyte antigen CD37, expressed on B lymphocytes; 3) Mammalian 

leukocyte antigen CD53 (OX-44), which is implicated in growth regulation in hematopoietic 
cells; 4) Mammalian lysosomal membrane protein CD63 (melanoma-associated antigen 
ME491; antigen AD1); 5) Mammalian antigen CD81 (cell surface protein TAPA-1), which 
is implicated in regulation of lymphoma cell growth; 6) Mammalian antigen CD82 (protein 
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R2; antigen C33; Kangai 1 (KAI1)), which associates with CD4 or CD8 and delivers 

costimulatory signals for the TCR/CD3 pathway; 7) Mammalian antigen CD151 (SFA-1; 

platelet-endothelial tetraspan antigen 3 (PETA-3)); 8) Mammalian cell surface glycoprotein 

A15 (TALLA-1; MXS1); 9) Mammalian novel antigen 2 (NAG-2); 10) Human tumor- 

5 associated antigen CO-029; 11) Schistosoma manspni and japonicum 23 Kd surface antigen. 

(SM23 / SJ23). 

The members of the 4 transmembrane family share several characteristics. First, they 
all are apparently type III membrane proteins, which are integral membrane proteins 
containing an N-terminal membrane-anchoring domain which is not cleaved during 

10 biosynthesis and which functions both as a translocation signal and as a membrane anchor. 
The family members also contain three additional transmembrane regions, at least seven 
conserved cysteines residues, and are of approximately the same size (218 to 284 residues). 
These proteins are collectively know as the "transmembrane 4 superfamily" (TM4) because 
they span plasma membrane four times. A schematic diagram of the domain structure of 

1 5 these proteins is as follows: 

+-+ + + — + + + + + — + 

1 1 TMa | Extra | TM2| Cyt| TM3 | Extracellular | TM4 1 Cyt| 

********* 

20 where Cyt is the cytoplasmic domain, TMa is the transmembrane anchor; TM2 to TM4 
represents transmembrane regions 2 to 4 V 9 C are conserved cysteines, and '* 'indicates the 
position of the consensus pattern. The consensus pattern spans a conserved region including 
two cysteines located in a short cytoplasmic loop between two transmembrane domains: 
Consensus pattern: G-x(3^[LIVMF]-x(2MGSA x ( 2 >- 

25 [EG]-x(2>[C WN]-[LIVM](2). 

b) Seven Transmembrane Integral Membrane Proteins. SEQ ID NOS: 24, 41, 101, 
157, 291, 305, 315, and 341 correspond to a sequence encoding a polypeptide that is a 
member of the seven transmembrane receptor family. G-protein coupled receptors 
(Strosberg, Eur. J. Biochem. (1991) 196:1; Kerlavage, Curr. Opiru Struct. Biol (1991) 

30 7:394; and Probst et aL, DNA Cell Biol (1992) 11:1 ; and Savarese et at, Biochem. J. (1992) 

293: 1) (also called R7G) are an extensive group of hormones, neurotransmitters, odorants 

and light receptors which transduce extracellular signals by interaction with guanine 

nucleotide-binding (G) proteins. The tertiary structure of these receptors is thought to be 

highly similar. They have seven hydrophobic regions, each of which most probably spans 
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the membrane. The N-terminus is located on the extracellular side of the membrane and is 

often glycosylated, while the C-terminus is cytoplasmic and generally phosphorylated. 

Three extracellular loops alternate with three intracellular loops to link the seven 

transmembrane regions. Most, but not all of these receptors, lack a signal peptide. The most 

5 conserved parts of these proteins are the transmembrane regions and the first two 

cytoplasmic loops. A conserved acidic- Arg-aromatic triplet is present in the N-terminal 

extremity of the second cytoplasmic loop (Attwood et al, Gene (1991) 98:153) and could 

be implicated in the interaction with G proteins. 

To detect this widespread family of proteins a pattern is used that contains the 

1 0 conserved triplet and that also spans the major part of the third transmembrane helix. 

Additional .information about the seven transmembrane receptor family, and methods for 

their identification and use, is found in U.S. Patent No. 5,759,804. Due in part to their 

expression on the cell surface and other attractive characteristics, seven transmembrane 

protein family members are of particular interest as drug targets, as surface antigen markers, 

1 5 and as drug delivery targets (e.g., using antibody-drug complexes and/or use of anti-seven 

transmembrane protein antibodies as therapeutics in their own right). 

c) Ank Repeats. SEQ ID NOS: 116 and 25 1 represent polynucleotides encoding Ank 

repeat-containing proteins. The ankyrin motif is a 33 amino acid sequence named after the 

protein ankyrin which has 24 tandem 33-arnino-acid motifs. Ank repeats were originally 

20 identified in the cell-cycle-control protein cdclO (Breeden et aL, Nature (1987) 52P:651). 

Proteins containing ankyrin repeats include ankyrin, myotropin, I-kappaB proteins, cell cycle 

protein cdclO, the Notch receptor (Matsuno et aL, Development (1997) 1 24(21) A26S); G9a 

(or B AT8) of the class III region of the major histocompatibility complex (Biochem J. 

290:81 1-818, 1993), FABP, GABP, 53BP2, Linl2, glp-l, SW14, and SW16. The functions 

25 . of the ankyrin repeats are compatible with a role in protein-protein interactions (Bork, 

Proteins (1993) 17(4):363; Lambert and Bennet, Eur. J. Biochem. (1993) 211:1; Kerr et al. $ 

Current Op. Cell Biol (1992) 4:496; Bennet et aL, J. Biol. Chem. (1980) 255:6424). 

The 90 kD N-terminal domain of ankyrin contains a series of 24 33-amino-acid ank 

repeats. (Lux et aL, Nature (1990) 34436-41, Lambert et aL, PNAS USA (1990) 87: 1730.) 

30 The 24 ank repeats form four folded subdomains of 6 repeats each. These four repeat 

subdomains mediate interactions with at least 7 different families of membrane proteins. 

Ankyrin contains two separate binding sites for anion exchanger dimers. One site utilizes 

repeat subdomain two (repeats 7-12) and the other requires both repeat subdomains 3 and 4 

(repeats 13-24). Since the anion exchangers exist in dimers, ankyrin binds 4 anion 
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exchangers at the same time. (Michaely and Bennett, J. Biol Chem. (1995) 270(37) :22050) 

The repeat motifs are involved in ankyrin interaction with tubulin, spectrin, and other 
membrane proteins. (Lux et al^ Nature (1990) 344:36.) 

The Rel/NF-kappaB/Dorsal family of transcription factors have activity that is 
5 controlled by sequestration in the cytoplasm in association with inhibitory proteins referred 
to as I-kappaB. (Gilmore, Cell (1990) 52:841; Nolan and Baltimore, Curr Opin Genet Dev. 
(1992) 2:21 1; Baeuerle, Biochim Biophys Acta (1991) 1072:63; Schmitz et aL, Trends Cell 
Biol. (1991) 7:130.) I-kappaB proteins contain 5 to 8 copies of 33 amino acid ankyrin 
repeats and certain NF-kappaB/rel proteins are also regulated by cis-acting ankyrin repeat 
10 containing domains including p 1 05NF-kappaB which contains a series of ankyrin repeats 
(Diehl and Hannink, J. Virol (1993) 67(12):1\6\). The I-kappaBs and Cactus (also 
containing ankyrin repeats) inhibit activators through differential interactions with the Rel- 
homology domain. The gene family includes proto-oncogenes, thus broadly implicating I- 
kappaB in the control of both normal gene expression and the aberrant gene expression that 
15 makes cells cancerous. (Nolan and Baltimore, Curr Opin Genet Dev. (1992) 2(2):2\ 1-220). 
In the case of rel/NF-kappaB and pp40/I-kappaB P, both the ankyrin repeats and the carboxy- 
tenninal domain are required for inhibiting DNA-binding activity and direct association of 
pp40/I-kappaB (J with rel/NF-kappaB protein. The ankyrin repeats and the carboxy-terminal 
of pp40/I-kappaBp ( form a structure that associates with the rel homology domain to inhibit 
20 DNA binding activity (Inoue et al^ PNAS USA (1 992) £P:4333). 

The 4 ankyrin repeats in the amino terminus of the transcription factor subunit 
GABP0 are required for its interaction with the GAB Pa subunit t6 form a functional high 
affinity DNA-binding protein. These repeats can be crosslinked to DNA when GAB P is 
bound to its target sequence. (Thompson et aL, Science (1991) 253:762; LaMarco et aL, 
25 Science (1991) 255:789). 

Myotrophin, a 12.5 kDa protein having a key role in the initiation of cardiac 
hypertrophy, comprises ankyrin repeats. The ankyrin repeats are characteristic of a hairpin- 
like protruding tip followed by a helix-tum-helix motif. The V-shaped helix-turn-helix of 
the repeats stack sequentially in bundles and are stabilized by compact hydrophobic cores, 
30 whereas the protruding tips are less ordered. 

d) ATPases Associated with Various Cellular Activities f AAAV SEQIDNOS:63, 
116, 134, 136, 151, 384, and 404 polynucleotides encoding novel members of the ** ATPases 
Associated with diverse cellular Activities** (AAA) protein family The AAA protein family 
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is composed of a large number of ATPases that share a conserved region of about 220 amino 
acids that contains an ATP-binding site (Froehlich et al.. J. Cell Biol. (1991) / 14:443; 
Erdmann et al. Cell (1991) 6V:499; Peters etal.. EMBOJ. (1990) 9:1757; Kunau et al., 
Biochimie (1993) 75:209-224; Confalonieri et al, BioEssays (1995) 77:639; 

5 http^/yeamob.pcixhemie.uni-tuebingen.de/AAA/Description.html)- The proteins that 
belong to this family either contain one or two AAA domains. 

Proteins containing two AAA domains include: 1) Mammalian and drosophila NSF 
(N-ethylmaleimide-sensitive fusion protein) and the fungal homolog, SEC 18, which are 
involved in intracellular transport between the endoplasmic reticulum and Golgi, as well as 
1 0 between different Golgi cisternae; 2) Mammalian transitional endoplasmic reticulum 

ATPase (previously known as p97 or VCP), which is involved in the transfer of membranes 
from the endoplasmic reticulum to the golgi apparatus. This ATPase forms a ring-shaped 
homooligomer composed of six subunits. The yeast homolog, CDC48, plays a role in 
spindle pole proliferation; 3) Yeast protein PAS1 essential for peroxisome assembly and the 
1 5 related protein PAS 1 from Pichia pastoris; 4) Yeast protein AFG2; 5) Sulfolobus 

acidocaldarius protein S AV and Halobacterium salinarium cdcH, which may be part of a 
transduction pathway connecting light to cell division. 

Proteins containing a single AAA domain include: 1) Escherichia coli and other 
bacteria ftsH (or hflB) protein. FtsH is an ATP-dependent zinc metallopepudase that 
20 degrades the heat-shock sigma-32 factor, and is an integral membrane protein with a large 
cytoplasmic C-tenninal domain mat contain both the AAA and the protease domains; 2) 
Yeast protein YME1 , a protein important for maintaining the integrity of the mitochondrial 
compartment YME1 is also a zinc-dependent protease; 3) Yeast protein AFG3 (or YTA10). 
This protein also contains an AAA domain followed by a zinc-dependent protease domain; 
25 4) Subunite from regulatory complex of the 26S proteasome (Hilt et al, Trends Biochem. 
ScL (1996) 2/:96), which is involved in the ATP-dependent degradation of ubiquitinated 
proteins, which subunits include: a) Mammalian 4 and homologs in other higher eukaryotes, 
in yeast (gene YTA5) and fission yeast (gene mts2); b) Mammalian 6 (TBP7) and homologs 
in other higher eukaryotes and in yeast (gene YTA2); c) Mammalian subunit 7 (MSS 1) and 
30 homologs in other higher eukaryotes and in yeast (gene CIM5 or YTA3); d) Mammalian 
subunit 8 (P45) and homologs in other higher eukaryotes and in yeast (SUG1 or CIM3 or 
TBY1) and fission yeast (gene letl); e) Other probable subunits include human TBP1, which 
influences HIV gene expression by interacting with the virus tat transactivator protein, and 

yeast YTA1 and YTA6; 5) Yeast protein BCS1, a mitochondrial protein essential for the 
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expression of the Rieske iron-sulfur protein; 6) Yeast protein MSP1, a protein involved in 

intramitochondrial sorting of proteins; 7) Yeast protein PAS8, and the corresponding 

proteins PASS from Pichia pastoris and PAY4 from Yarrowia Iipolytica; 8) Mouse protein 

SKD1 and its fission yeast homolog (SpAC2Gl 1.06); 9) Caenorhabditis elegans meiotic 

spindle formation protein mei-1 ; 10) Yeast protein SAPT I I) Yeast protein YTA7; and 12) 

Mycobacterium leprae hypothetical protein A2126A. 

In general, the AAA domains in these proteins act as ATP-dependent protein 

clamps(ConfaIonieii et al. (1995) Bio Essays 77:639). In addition to the ATP-binding 'A' and 

•B' motifs, which are located in the N-terminal half of this domain, there is a highly 

conserved region located in the central part of the domain which was used in the 

development of the signature pattern. The consensus pattern is: (TLIVMT]-x-[LIVMT]- 

[LIVMF]-x-[GAmCMST]^ 

e) Basic Region Plus Leucine Zipper Transcription Factors, SEQIDNO:374 
correspond to a polynucleotide encoding a novel member of the family of basic region plus 
leucine zipper transcription factors. The bZEP superfamily (Hurst, Protein Prof. (1995) 
2:105; and Eilenberger, Curr. Opiru Struct Biol. (1994) 4:\2) of eukaiyotic DNA-binding 
transcription factors encompasses proteins that contain a basic region mediating sequence- 
specific DNA-binding followed by a leucine zipper required for dimerization. Members of 
the family include transcription factor AP-1, which binds selectively to enhancer elements in 
the cis control regions of SV40 and metallothionein HA. AP-1, also known as c-jun, is the 
cellular homolog of the avian sarcoma virus 17 (ASV17) oncogene v-jun. 

Other members of this protein family include jun-B and jun-D, probable 
transcription factors that are highly similar tojun/AP-1 ; the fos protein, a proto-oncogene 
that forms a non-covalent dimer with c-jun; the fos-related proteins fra-1, and fos:B; and 
mammalian cAMP response element (CRE) binding proteins CREB, CREM, ATF-1, ATF- 
3, ATF-4, ATF-5, ATF-6 and LRF-1. The consensus pattern for this protein family is: 
[KR]<U)-[RKSA<a-N^^ 

f) Bromodomain. SEQ ED NO:97 corresponds to a polynucleotide encoding a 

polypeptide having a bromodomain region (Haynes et al., 1992, Nucleic Acids Res. 

20:2693-2603, Tamkun et al., 1992, Cell 68:561-572, and Tamkun, 1995, Curr. Opin. Genet 

Dev. 5:473-477), which is a conserved region of about 70 amino acids found in the 

following proteins: 1) Higher eukaryotes transcription initiation factor TFIID 250 Kd 

subunit (TBP-associated factor p250) (gene CCOl); P250 is associated with the TFIID 

TATA-box binding protein and seems essential for progression of the Gl phase of the cell 
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cycle. 2) Human RING3, a protein of unknown function encoded in the MHC class II locus; 

3) Mammalian CREB-binding protein (CBP), which mediates cAMP-gene regulation by 

binding specifically to phosphorylated CREB protein; 4) Mammalian homologs of brahma, 

including three brahma-like human: SNF2a(hBRM), SNF2b f and BRG1; 5) Human BS69, 

a protein that binds to adenovirus El A and inhibits El A transactivation; 6) Human peregrin 

(or Br 140); 

The bromodomain is thought to be involved in protein-protein interactions and may 
be important for the assembly or activity of multicomponent complexes involved in 
transcriptional activation. The consensus-pattern, which spans a major part of the 
bromodomain, is: [STAmnF]-x(2)-F-x(4>pNS]-x(5J)-pENQTF]-Y-(OTY].x(2> 
[LIVMF¥]-x(3>[LI^ 
[FY], 

eT EF-Hand. SEQ ID NOS:136, 242, and 379 correspond to polynucleotides 
encoding a novel protein in the family of EF-hand proteins. Many calcium-binding proteins 
belong to the same evolutionary family and share a type of calcium-binding domain known 
as the EF-hand (Kawasaki et al. 9 Protein. Prof. (1995) 2:305-490). This type of domain 
consists of a twelve residue loop flanked on both sides by a twelve residue alpha-helical 
domain. In an EF-hand loop the calcium ion is coordinated in a pentagonal bipyramidal 
configuration. The six residues involved in the binding are in positions 1, 3, 5, 7, 9 and 12; 
these residues are denoted by X, Y, Z, -Y, -X and -Z. The invariant Glu or Asp at position 
12 provides two oxygens for liganding Ca (bidentate ligand). 

Proteins known to contain EF-hand regions include: Calmodulin (Ca=4, except in 
yeast where Ca=3) C*Ca= w indicates approximate number of EF-hand regions); 
diacylglycerol kinase (EC 2.7.1,107) (DGK) (Ca=2); 2) FAD-dependent glycerol-3- 
phosphate dehydrogenase (EC 1.1.99.5) from mammals (Ca=l); guanylate cyclase activating 
protein (GCAP) (Ca=3); MIF related proteins 8 (MRP-8 or CFAG) and 14 (MRP- 14) 
(Ca=2); myosin regulatory light chains (Ca=l); oncomodulin (Ca=2); osteonectin (basement 
membrane protein BM-40) (SPARC); and proteins that contain an "osteonectin" domain 
(QR1, matrix glycoprotein SCI). 

The consensus pattern includes the complete EF-hand loop as well as the first residue 
which follows the loop and which seem to always be hydrophobic. 

Consensus pattern: D-x-pNS]-{ILVFYW}-PENSTG]-PNQGHRK]-{GP}- 
[Lr^C]-pENQSTACK:]-x(2)-pE]-[LIVMFYW] 
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h) Eulcarvotic Aspartvl Proteases. SEQ ID NO:308 corresponds to a gene encoding a 

novel eukaryotic aspartyl protease. Aspartyl proteases, known as acid proteases, (EC 
3.4.23.-) are a widely distributed family of proteolytic enzymes (Foltmann B., Essays 
Biochenu (1981) 77:52; Davies D.R., Annu. Rev. Biophys. Chenu (1990) 7P:189; Rao 
5 J.K.M., et ai f Biochemistry (1991) 30:4663) known to exist in vertebrates, fungi, plants, 
retroviruses and some plant viruses. Aspartate proteases of eukaryotes are monomelic 
enzymes which consist of two domains. Each domain contains an active site centered on a 
catalytic aspartyl residue. The two domains most probably evolved from the duplication of 
an ancestral gene encoding a primordial domain. Currently known eukaryotic aspartyl 

10 proteases include: 1) Vertebrate gastric pepsins A and C (also known as gastricsin); 2) 
Vertebrate chymosin (rennin), involved in digestion and used for making cheese; 3) 
Vertebrate lysosomal cathepsins D (EC 3.4.23.5) and E (EC 3.4.23.34); 4) Mammalian renin 
(EC 3.4.23.15) whose function is to generate angiotensin I from angiotensinogen in the 
plasma; 5) Fungal proteases such as aspergillopepsin A (EC 3.4.23.18), candidapepsin (EC 

1 5 3.4.23.24), mucoropepsin (EC 3.4.23.23) (mucor rennin), endothiapepsin (EC 3.4.23.22), 
polyporopepsin (EC 3.4.23.29), and rhizopuspepsin (EC 3.4.23.21); and 6) Yeast 
saccharopepsin (EC 3.4.23.25) (proteinase A) (gene PEP4). PEP4 is implicated in 
posttranslational regulation of vacuolar hydrolases; 7) Yeast barrierpepsin (EC 3.4.23.35) 
(gene BAR1); a protease that cleaves alpha-factor and thus acts as an antagonist of the 

20 mating pheromone; and 8) Fission yeast sxal which is involved in degrading or processing 
the mating pheromones. 

Most retroviruses and some plant viruses, such as badnaviruses, encode for an 
aspartyl protease which is an homodimer of a chain of about 95 to 125 amino acids. In most 
retroviruses, the protease is encoded as a segment of a polyprotein which is cleaved during 

25 the maturation process of the virus. It is generally part of the pol polyprotein and, more 

rarely, of the gag polyprotein. Because the sequence around the two aspartates of eukaryotic 
aspartyl proteases and around the single active site of the viral proteases is conserved, a 
single signature pattern can be used to identify members of both groups of proteases. The 
consensus pattern is: [LrVMFGAC]-[LIVMTAD^ 

50 [STAPDENQ]- x-[LIVMFSTNC]-x-[LIVMFGTA], where D is the active site residue. 

i) GATA Family of Transcription Factors. SEQ ID NO:213 corresponds to a novel 

member of the GATA family of transcription factors. The GATA family of transcription 

factors are proteins that bind to DNA sites with the consensus sequence (A/T)GATA(A/G), 

found within the regulatory region of a number of genes. Proteins currently known to belong 
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to this family are: I) GATA-1 (Trainor, CD., et al. Nature (1990) 343:92) (also known as 

Eryfl, GF-1 or NF-E1), which binds to the GATA region of globin genes and other genes 
expressed in erythroid cells. It is a transcriptional activator which probably serves as a 
general 'switch' factor for erythroid development; 2) GATA-2 (Lee, M.E., et al, J. Biol. 
Chem. (1991) 266A61SS), a transcriptional activator which regulates endothelin-1 gene 
expression in endothelial cells; 3) GATA-3 (Ho, I.-C, et al,, EMBOJ. (1991) 10:\ 187), a 
transcriptional activator which binds to the enhancer of the T-cell receptor alpha and delta 
genes; 4) GATA-4 (Spieth, J.,etal, MolCell. Biol. (1991). 77:4651), a transcriptional 
activator expressed in endodermally derived tissues and heart; 5) Drosophila protein pannier 
(or DGATAa) (gene pnr) which acts as a repressor of the achaete-scute complex (as-c); 6) 
Bombyx mori BCFI (Drevet, J.R., et al, J. Biol. Chem, (1994) 2^:10660), which regulates 
the expression of chorion genes; 7) Caenorhabditis elegans elt-l and elt-2, transcriptional 
activators of genes containing the GATA region, including vitellogenin genes (Hawkins, 
M.G., etal.,J. Biol. Chem. (1995) 270:14666); 8) Ustilago maydis urbsl (Voisard, C.P.O., 
15 et al., Mot. Cell. Biol. (1993) 75:7091), a protein involved in the repression of the 
biosynthesis of siderophores; 9) Fission yeast protein GAF2. 

All these transcription factors contain a pair of highly similar 'zinc finger* type 
domains with the consensus sequence C-x2-C-xI7-C-x2-C. Some other proteins contain a 
single zinc finger motif highly related to those of the GATA transcription factors. These 
20 proteins are: 1) Drosophila box A-binding factor (ABF) (also known as protein serpent 

(gene sip)) which may function as a transcriptional activator protein and may play a key role 
in the organogenesis of the fat body; 2) Emericella nidulans are (Arst, H.N., Jr., et al, 
Trends Genet. (1989) 5:291) a transcriptional activator which mediates nitrogen metaboUte 
repression^) Neurospora crassa nit-2 (Fu, Y.-H., et al, Mol Cell Biol (1990) 10: 1056), a 
25 transcriptional activator which turns on the expression of genes coding for enzymes required 
for the use of a variety of secondary nitrogen sources, during conditions of nitrogen 
limitation; 4) Neurospora crassa white collar proteins 1 and 2 (WC-1 and WC-2), which 
control expression of light-regulated genes; 5) Saccharomyces cerevisiae DAL8 1 (or 
UGA43), a negative nitrogen regulatory protein; 6) Saccharomyces cerevisiae GLN3, a 
30 positive nitrogen regulatory protein; 7) Saccharomyces cerevisiae GAT1 ; 8) Saccharomyces 
cerevisiae GZF3. 

The consensus pattern for the GATA family is: C-x-[DN]-C-x(4,5>[ST]-x(2)-W- 
[HR]-[RK]-x(3)-[GN]-x(3,4)-C-N-[AS]-C, where the four C's are zinc ligands. 
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j) G-Protein Alpha Subunit. SEQ ID NO:367 corresponds to a gene encoding a novel 

polypeptide of the G-protein alpha subnnit family. Guanine nucleotide binding proteins (G- 

proteins) are a family of membrane-associated proteins that couple extracellularly-activated 

integral-membrane receptors to intracellular effectors, such as ion channels and enzymes that 

5 vary the concentration of second messenger molecules. .G-proteins are composed of 3 

subunits (alpha, beta and gamma) which, in the resting state, associate as a trimer at the inner 

face of the plasma membrane. The alpha subunit has a molecule of guanosine diphosphate 

(GDP) bound to it Stimulation of the G-protein by an activated receptor leads to its 

exchange for GTP (guanosine triphosphate). This results in the separation of the alpha from 

10 the beta and gamma subunits, which always remain tightly associated as a dimer. Both the 
alpha and beta-gamma subunits are then able to interact with effectors, either individually or 
in a cooperative manner. The intrinsic GTPase activity of the alpha subunit hydrolyses the 
bound GTP to GDP. This returns the alpha subunit to its inactive conformation and allows it 
to reassociate with the beta-gamma subunit, thus restoring the system to its resting state. 

IS G-protein alpha subunits are 350-400 amino acids in length and have molecular 

weights in the range 40-45 kDa. Seventeen distinct types of alpha subunit have been 
identified in mammals. These fall into 4 main groups on the basis of both sequence 
similarity and function: alpha-s, alpha-q, alpha-i and alpha- 12 (Simon etaL, Science (1993) 
252:802). Many alpha subunits are substrates for ADP-ribosylation by cholera or pertussis 

20 toxins. They are often N-terminally acylated, usually with myristate and/or palmitoylate, 
and these fatty acid modifications are probably important for membrane association and 
high, affinity interactions with other proteins. The atomic structure of the alpha subunit of 
the G-protein involved in mammalian vision, transducin, has been elucidated in both GTP- 
and GDB-bound forms, and shows considerable similarity in both primary and tertiary 

25 . structure in the nucleotide-binding regions to other guanine nucleotide binding proteins, such 
as p21-ras and EF-Tu. 

k) Phorbol Esters/Diacvlglvcerol Binding, SEQ ID NO:188 and 251 represent 
polynucleotides encoding a protein belonging to the family including phorbol 
esters/diacylglycerol binding proteins. Diacylglycerol (DAG) is an important second 

30 messenger. Phorbol esters (PE) are analogues of DAG and potent tumor promoters that 

cause a variety of physiological changes when administered to both cells and tissues. DAG 

activates a family of serine/threonine protein kinases, collectively known as protein kinase C 

(PKC) (Azzi et aL 9 Eur. J. Biochenu (1992) 205:547). Phorbol esters can directly stimulate 

PKC. The N-terminal region of PKC, known as CI, has been shown (Ono et a/., Proc. Natl 
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Acad Set. USA (1989) S*4868) to bind PE and DAG in a phospholipid and zinc-dependent 
fashion. The CI region contains one or two copies (depending on the isozyme of PKC) of a 
cysteine-rich domain about 50 amino-acid residues long and essential for DAG/PE-binding. 
Such a domain has also been found in, for example, the following proteins. 

(1) Diacylglycerol kinase (EC 2.7.1.107) (DGK) (Sakane et aL, Nature (1990) 
J^:345), the enzyme that converts DAG into phosphatidate. It contains two copies of the 
DAG/PE-binding domain in its N-terminal section. At least five different forms of DGK are 
known in mammals; and 

(2) N-chimaerin, a brain specific protein which shows sequence similarities with the 
BCR protein at its C-terminal part and contains a single copy of the DAG/PE-binding 
domains its N-terminal part. It has been shown (Ahmed et aL, Biochem. J. (1990) 272:767, 
and Ahmed etaL, Biochem. J. (1991) 250:233) to be able to bind phorbol esters. 

the DAG/PE-binding domain binds two zinc ions; the ligands of these metal ions are 
probably the six cysteines and two histidines that are conserved in this domain. The 
signature pattern completely spans the DAG/PE domain. The consensus pattern is: H-x- 
[LIVMFYW]-x(8,ll)-C-x(2)-C-x(3>^^ 
C. All the C and H are probably involved in binding zinc. 

1) Protein Kinase. SEQ ID NOS:202, 315, 367, and 397 represent polynucleotides 
encoding protein kinases. Protein kinases catalyze phosphorylation of proteins in a variety of 
pathways, and are implicated in cancer. Eukaryotic protein kinases (Hanks S.K., et al, 
FASEBJ. (1995) P:576; Hunter T., Meth. Enzymol. (1991) 200:3; Hanks S.K., etal., Meth. 
Enzymol (1991) 200:38; Hanks SJK, Curr. Opin. Struct. Biol. (1991) 7:369; Hanks SJK., et 
al., Science (1988) 241:42) are enzymes that belong to a very extensive family of proteins 
which share a conserved catalytic core common to both serine/threonine and tyrosine protein 
kinases. There are a number of conserved regions in the catalytic domain of protein kinases. 
Two of the conserved regions are the basis for the signature pattern in the protein kinase 
profile. The first region, which is located in the N-terminal extremity of the catalytic 
domain, is a glycine-rich stretch of residues in the vicinity of a lysine residue, which has 
been shown to be involved in ATP binding. The second region, which is located in the 
central part of the catalytic domain, contains a conserved aspartic acid residue which is 
important for the catalytic activity of the enzyme (Knighton D.R., et aL, Science (1991) 
253AQ7). The protein kinase profile includes two signature patterns for this second region: 
one specific for serine/threonine kinases and the other for tyrosine kinases. A third profile is 
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based on the alignment in (Hanks S.K., et al.^ FASEB J. (1 995) 0:576) and covers the entire 

catalytic domain. The consensus patterns are as follows: 

1) Consensus pattern: [Lrv^^-{P}^.{P}-[FYWMGSTNH]-[SGA]-{PW}- 

(LIVCATHPD}-x-[GSTACLIV^ 
5 [LIVMFAGCKR]-K, where K binds ATP. The majority of known protein kinases are 
detected by this pattern. Proteins kinases that are not detected by this consensus include 
viral kinases, which are quite divergent in this region and are completely missed by this 
pattern. 

2) Consensus pattern: [LIVMFYC]-x-|HY].x-D-[LIVMFY]-K-x(2)-N- 

10 [LIVMFYCT](3), where D is an active site residue. This consensus sequence identifies most 
serine/threo nine-specific protein kinases with only 1 0 exceptions. Half of the exceptions are 
viral kinases, while the other exceptions include Epstein-Barr virus BGLF4 and Drosophila 
ninaC, which have Ser and Arg, respectively, instead of the conserved Lys. These latter two 
protein kinases are detected by the tyrosine kinase specific pattern described below. 

1 5 3) Consensus pattern: [LIVMFYC]-x-[HY]-x-D-[LIVMFY]-[RSTAC]-x(2)-N- 

[LIVMFYC], where D is an active site residue. All tyrosine-specific protein kinases are 
detected by this consensus pattern, with the exception of human ERBB3 and mouse blk. 
This pattern also detects most bacterial aminoglycoside phosphotransferases (Benner S., 
Nature (1987) 329J21; Kirby R., J. MoL EvoL (1992) 50:489) and herpesviruses ganciclovir 

20 kinases (Littler E., et aL, Nature (1992) 358:160), which are structurally and evolutionary 
related to protein kinases. 

The protein kinase profile also detects receptor guanylate cyclases and 2-5 A- 
dependent ribonucleases. Sequence similarities between these two families and the 
eukaryotic protein kinase family have been noticed previously. The profile also detects 

25 Arabidopsis thaliana kinase-like protein TMKL1 which seems to have lost its catalytic 
activity. 

If a protein analyzed includes the two of the above protein kinase signatures, the 
probability of it being a protein kinase is close to 100%, Eukaryotic-type protein kinases 
have also been found in prokaryotes such as Myxococcus xanthus (Munoz-Dorado J M et al , 
30 Cell (1991) 57:995) and Yersinia pseudotuberculosis. The patterns shown above has been 
updated since their publication in (Bairoch A., et al y Nature (1988) 331:22). 

m) Protein Phosphatase 2C. SEQ ID NO:256 corresponds to a polynucleotide 

encoding a novel protein phosphatase 2C (PP2C), which is one of the four major classes of 

mammalian serine/threonine specific protein phosphatases. PP2C (Wenk et aL, FEBS Lett 
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(1992) 297:135) is a monomelic enzyme of about 42 Kd which shows broad substrate 

specificity and is dependent on divalent cations (mainly manganese and magnesium) for its 

activity. Three isozymes are currently known in mammals: PP2C-alpha, -beta and -gamma. 

n) Protein Tyrosine Phosphatase, SEQ ID NO:382 represents a polynucleotide 

5 encoding a protein tyrosine kinase. Tyrosine specific protein phosphatases (EC 3.1.3.48) 

(PTPase) (Fischer et aL, Science (1991) 255:401; Charbonneau et al.Annu. Rev. Cell Biol 

(1992)5:463; Trowbridge,*/. Biol Chenu (1991) 266:23517; Tonks etal, Trends Biochem. 

Set (1989) 74:497; and Hunter, Cell (1989) 55:1013) catalyze the removal of a phosphate 

group attached to a tyrosine residue. These enzymes are very important in the control of cell 

1 0 growth, proliferation, differentiation and transformation. Multiple forms of PTPase have 

been characterized and can be classified into two categories: soluble PTPases and 

transmembrane receptor proteins that contain PTPase domain(s). 

Soluble PTPases include PTPN3 (HI) and PTPN4 (MEG), enzymes that contain an 

N-terminal band 4.1 -like domain and could act at junctions between the membrane and 

15 cytoskeleton; PTPN6 (PTP-1C; HCP; SHP) and PTPN1 1 (PTP-2C; SH-PTP3; Syp), 

enzymes that contain two copies of the SH2 domain at its N-terminal extremity. 

Dual specificity PTPases include DUSP1 (PTPN10; MAP kinase phosphatase- 1; 

MKP-1) which dephosphorylates MAP kinase on both Thr-183 and Tyr-185; and DUSP2 

(PAC-1), a nuclear enzyme that dephosphorylates MAP kinases ERK1 and ERK2 on both 

20 Thr and Tyr residues. 

Structurally, all known receptor PTPases are made up of a variable length 

extracellular domain, followed by a transmembrane region and a C-tenninal catalytic 

cytoplasmic domain. Some of the receptor PTPases contain fibronectin type III (FN-IH) 

repeats, immunoglobulin-like domains, MAM domains or carbonic anhydrase-like domains 

25 in their extracellular region. The cytoplasmic region generally contains two copies of the 

PTPAse domain. The first seems to have enzymatic activity, while the second is inactive but 

seems to affect substrate specificity of the first In these domains, the catalytic cysteine is 

generally conserved but some other, presumably important, residues are not. 

PTPase domains consist of about 300 amino acids. There are two conserved 

30 cysteines and the second one has been shown to be absolutely required for activity. 

Furthermore, a number of conserved residues in its immediate vicinity have also been shown 

to be important. The consensus pattern for PTPases is: [LIVMF]-H-C-x(2)-G-x(3)-[STC]- 

[STAGP]-x-[LIVMFY]; C is the active site residue. 
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o) SH3 Domain. SEQ ID NO:306 and 386 represent polynucleotides encoding SH3 
domain proteins. The Src homology 3 (SIC) domain is a small protein domain of about 60 
amino acid residues first identified as a conserved sequence in the non-catalytic part of 
several cytoplasmic protein tyrosine kinases (e.g. Src, Abl, Lck) (Mayer et al., Nature (1988) 
5 332:272). The domain has also been found in a variety of intracellular or membrane- 
associated proteins (Musacchio et al. , FEBS Lett. (1 992) 507:55; Pawson et al. , Curr. Biol. 
(1993) 5:434; Mayer et al., Trends Cell Biol. (1993) 5:8; and Pawson et al., Nature (1995) 
575:573). 

The SH3 domain has a characteristic fold that consists of five or six beta-strands 
) arranged as two tightly packed anti-parallel beta sheets. The linker regions may contain 
short helices (Kuriyan etal, Curr. Opin. Struct. Biol. (1993) 5:828). It is believed that SIB 
domain-containing proteins mediate assembly of specific protein complexes via binding to 
proline-rich peptides (Morton et al., Curr. Biol. (1994) ¥:615). In general, SID domains are 
found as single copies in a given protein, but there is a significant number of proteins with 
two SH3 domains and a few with 3 or 4 copies. 

SH3 domains have been identified in, for example, protein tyrosine kinases, such as 
the Src, Abl, Bkt, Csk and ZAP70 families of kinases; mammalian phosphatidylinositol- 
specific phospholipase C-gamma-1 and -2; mammalian phosphatidyl inositol 3-kinase 
regulatory p85 subunit; mammalian Ras GTPase-activating protein (GAP); mammalian Vav 
oncoprotein, a guanine nucleotide exchange factor of the CDC24 family; Drosophila 
lethal(l)discs Iarge-1 tumor suppressor protein (gene Dlgl); mammalian tight junction 
protein ZO-l; vertebrate erythrocyte membrane protein p55; Caenorhabditis elegans protein 
lin-2; rat protein CASK; and mammalian synaptic proteins SAP90/PSD-95, CHAPSYN- 
110/PSb-53, SAP97/DLGl andSAP102. Novel SH3-domain containing polypeptides will 
facilitate elucidation of the role of such proteins in important biological pathways, such as 
ras activation. 

P) Tr YPsin- SEQ ID NO: 169 corresponds to a novel serine protease of the trypsin 

family. The catalytic activity of the serine proteases from the trypsin family is provided by a 

charge relay system involving an aspartic acid residue hydrogen-bonded to a histidine, which 

itself is hydrogen-bonded to a serine. The sequences in the vicinity of the active site serine 

and histidine residues are well conserved in this family of proteases (Brenner S., Nature 

(1988)55*528). Proteases known to belong to the trypsin family include: l)Acrosin;2) 

Blood coagulation factors VII, IX, X, XI and XII, thrombin, plasminogen, and protein C; 3) 

Cathepsin G; 4) Chymotrypsins; 5) Complement components Clr, Cls, C2, and complement 
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factors B, D and I; 6) Complement-activating component of RA-reactive factor; 7) Cytotoxic 

cell proteases (granzymes A to H); 8) Duodenase I; 9) Elastases 1, 2, 3 A, 3B (protease E), 

leukocyte (medullasin).; 10) Enterokinase (EC 3.4.21.9) (enteropeptidase); 1 1 ) Hepatocyte 

growth factor activator, 12) Hepsin; 13) Glandular (tissue) kallikreins (including EGF- 

binding protein types A, B, and C, NGF-gamma chain, gamma-rerun, prostate specific 

antigen (PSA) and tonin); 14) Plasma kallikrein; 15) Mast cell proteases (MCP) 1 (chymase) 

to 8; 16) Myeloblastin (proteinase 3) (Wegener's autoantigen); 17) Plasminogen activators 

(urokinase-type, and tissue-type); 18) Trypsins I, II, III, and IV; 19) Tryptases; 20) Snake 

venom proteases such as ancrod, batroxobin, cerastobin, flavoxobin, and protein C activator, 

21) Collagenase from common cattle grub and collagenolytic protease from Atlantic sand 

fiddler crab; 22) Apolipoprotein(a); 23) Blood fluke cercarial protease; 24) Drosophila 

trypsin like proteases: alpha, easter, snake-locus; 25) Drosophila protease stubble (gene sb); 

and 26) Major mite fecal allergen Der p III. All the above proteins belong to family S 1 in 

the classification of peptidases (Rawlings N.D., et aL, Metk Enzymol (1994) 244:19; 

http^/www.expasv.ch/cgi-bin/lists?peptidas,txt) and originate from eukarvotic species. It 

should be noted that bacterial proteases that belong to family S2A are similar enough in the 

regions of the active site residues that they can be picked up by the same patterns. 

The consensus patterns for this trypsin protein family are: 1) [LIVM]-[STJ-A- 

[STAGJ-H-C, where H is the active site residue. All sequences known to belong to this class 

detected by the pattern, except for complement components Clr and Cls, pig plasminogen, 

bovine protein C, rodent urokinase, ancrod, gyroxin and two insect trypsins; 2) 

PNSTAGC]-[GSTAPIMVQH]-x(2)-G-pE]-S-G-[GS]-[SAPHV]- [LIVMFYWH]- 

[LIVMFYSTANQH], where S is the active site residue. All sequences known to belong to 

this family are detected by the above consensus sequences, except for 1 8 different proteases 

which have lost the first conserved glycine. If a protein includes both the serine and the 

histidine active site signatures, die probability: of it being a trypsin family Serine protease is 

100%. 

q) WD Domain. G-Beta Repeats. SEQ ID NOS: 1 88 and 335 represent novel 
members of the WD domain/G-beta repeat family. Beta-transducin (G-beta) is one of the 
three subunits (alpha, beta, and gamma) of the guanine nucleotide-binding proteins (G 
proteins) which act as intermediaries in the transduction of signals generated by 
transmembrane receptors (Gilman, Annie Rev. Biochenu (1987) 56:615). The alpha subunit 
binds to and hydrolyzes GTP; the functions of the beta and gamma subunits are less clear but 
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they seem to be required for the replacement of GDP by GTP as well as for membrane 
anchoring and receptor recognition. 

In higher eukaryotes, G-beta exists as a small multigene family of highly conserved 
proteins of about 340 amino acid residues. Structurally, G-beta consists of eight tandem 
repeats of about 40 residues, each containing a central Trp-Asp motif (this type of repeat is 
sometimes called a WD-40 repeat). Such a repetitive segment has been shown to exist in a 
number of other proteins including: human LIS1, a neuronal protein involved in type-1 
Iissencephaly; and mammalian coatomer beta' subunit (beta'-COP), a component of a 
cytosolic protein complex that reversibly associates with Golgi membranes to form vesicles 
that mediate biosynthetic protein transport. 

The consensus pattern for the WD domain/G-Beta repeat family is: [LIVMSTAC]- 

[LIVMFWSTAGC]-[LIMSTAG]-rLIVMSTAGC]-x(2)-pN]<2)-[LIVMWSTAC]-x- 
[LIVMFSTAG]-W-PEN]-[LIVMFSTAGCN]. 

r) wnt Family of Develop mental Signaling Proteins SEQ ID NO: 23, 291, 324, 330, 

341, and 353 correspond to novel members of the wnt family of developmental signaling 

proteins. Wnt-1 (previously known as int-1), the seminal member of this family, (Nusse R., 

Trends Genet. (1988) ¥29 1) is a proto-oncogene induced by the integration of the mouse 

mammary tumor virus. It is thought to play a role in intercellular communication and seems 

to be a signalling molecule important in the development of the central nervous system 

(CNS). The sequence of wnt-1 is highly conserved in mammals, fish, and amphibians. Wnt- 

1 was found to be a member of a large family of related proteins (Nusse R., et al 3 Cell 
(1992) <8>:1073; McMahon A.P., Trends Genet. (1992) 8:1; Moon R/T, BioEssays (1993) 
15:91) that are all thought to be developmental regulators. These proteins are known as wnt- 

2 (also known as irp), wnt-3, -3A, -4, -5A, -5B, -6, -7A, -7B, -8, -8B, -9 and -10. At least 
four members of this family are present in Drosophila; one of them, wingless (wg), is 
implicated in segmentation polarity. All these proteins share the following features 
characteristics of secretory proteins: a signal peptide, several potential N-glycosylation sites 
and 22 conserved cysteines that are probably involved in disulfide bonds. The Wnt proteins 
seem to adhere to the plasma membrane of the secreting cells and are therefore likely to 
signal over only few cell diameters. The consensus pattern, which is based upon a highly 
conserved region including three cysteines, is as follows: C-K-C-H-G-[LIVMT]-S-G-x-C. 
All sequences known to belong to this family are detected by the provided consensus pattern. 

s ) Ww/rsp5/WWP Domai n-Containing Proteins. SEQ ID NOS:188, 379 , and 395 

represent polynucleotides encoding a polypeptide in the family of WW/rsp5/WWP domain- 
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containing proteins. The WW domain (Bork et al. 9 Trends Biochem. Set (1994) 19:53 1 ; 

Andre et at, Biochem. Biophys. Res. Commun. (1994) 205:1201; Hofmann etal, FEBS Lett 

(1995) J5*:153; and Sudol et aL 9 FEBS Lett. (1995) 369:67), also known as rsp5 or WWP), 

was originally discovered as a short conserved region in a number of unrelated proteins, 

5 among them dystrophin, the gene responsible for Duchenne muscular dystrophy. The 

domain, which spans about 35 residues, is repeated up to 4 times in some proteins. It has 

been shown (Chen et al. , Proc. Natl. Acad Sci. USA ( 1 995) 92:7% 1 9) to bind proteins with 

particular proline-motifs, [AP]-P-P-[AP]-Y, and thus resembles somewhat SH3 domains. It 

appears to contain beta-strands grouped around four conserved aromatic positions, generally 

10 Trp. The name WW or WWP derives from the presence of these Trp as well as that of a 

conserved Pro. It is frequently associated with other domains typical for proteins in signal 

transduction processes. 

Proteins containing the WW domain include: 

1. Dystrophin, a multidomain cytoskeletal protein. Its longest alternatively 

1 5 spliced form consists of an N-terminal actin-binding domain, followed by 24 spectrin-like , 
repeats, a cysteine-rich calcium-binding domain and a C-terminal globular domain. 
Dystrophins form tetramers and is thought to have multiple functions including involvement 
in membrane stability, transduction of contractile forces to the extracellular environment and 
organization of membrane specialization. Mutations in the dystrophin gene lead to muscular 

20 dystrophy of Duchenne or Becker type. Dystrophin contains one WW domain C-tenninal of 
the spectrin-repeats. 

2. Vertebrate YAP protein, which is a substrate of an unknown serine kinase. It 
binds to the SH3 domain of the Yes oncoprotein via a proline-rich region. This protein 
appears in alternatively spliced isoforms, containing either one or two WW domains. 

25 3. IQGAP, which is a human GTPase activating protein acting onxas. It 

contains an N-terminal domain similar to fly muscle mp20 protein and a C-terminal ras 

GTPase activator domain. 

For the sensitive detection of WW domains, the profile spans the whole homology 

region as well as a pattern. The consensus for this family is: W-x(9,l 1)-[VFY]-[FYW]- 

30 x(6,7)-(GSTNE]-[GSTQCR]-[FYW]-x(2)-P. 

t) Zinc Finger. C2H2 Type. SEQ ID NO:61, 306, and 386 correspond to 

polynucleotides encoding novel members of the of the C2H2 type zinc finger protein family. 

Zinc finger domains (Klug et aL 9 Trends Biochem. Set (1987) 72:464; Evans et aL 9 Cell 

(1988) 52:1; Payre et FEBS Lett. (1988) 234:245; Miller et a/., EMBOJ. (1985) 4:1609; 
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and Berg, Proc. Natl. Acad. Sci. USA (1988) 55:99) are nucleic acid-binding protein 
structures first identified in the Xenopus transcription factor TFIIIA. These domains have 
since been found in numerous nucleic acid-binding proteins. A zinc ringer domain is 
composed of 25 to 30 amino acid residues. Two cysteine or histidine residues are positioned 
5 at both extremities of the domain, which are involved in the tetrahedral coordination of a 
zinc atom. It has been proposed that such a domain interacts with about five nucleotides. 

Many classes of zinc fingers are characterized according to the number and positions 
of the histidine and cysteine residues involved in the zinc atom coordination. In the first 
class to be characterized, called C2H2, the first pair of zinc coordinating residues are 
10 cysteines, while the second pair are histidines. A number of experimental reports have 
demonstrated the zinc-dependent DNA or RNA binding property of some members of this 
class. 

Mammalian proteins having a C2H2 zipper include (number in parenthesis indicates 
number of zinc finger regions in the protein): basonuclin (6), BCL-67LAZ-3 (6), erythroid 
krueppel-like transcription factor (3), transcription factors Spl (3), Sp2 (3), Sp3 (3) and 
Sp(4) 3, transcriptional repressor YY1 (4), Wilms' tumor protein (4), EGRl/Krox24 (3), 
EGR2/Krox20 (3), EGR3/Pilot (3), EGR4/AT133 (4), Evi-1 (10), GLI1 (5), GLI2 (4+), 
GU3 (3+), HIV-EP 1/ZNF40 (4), HTV-EP2 (2), KR1 (9+), KR2 (9), KR3 (15+), KR4 (14+), 
KR5(11+),HF.12(6+),REX-1 (4), ZfX(13),ZfY (13), Zfp-35 (18), ZNF7 (15), ZNF8 (7), 
ZNF35 (10), ZNF42/MZF-1 (13), ZNF43 (22), ZNF46/Kup (2), ZNF76 (7), ZNF91 (36), 
ZNF133 (3). 

In addition to the conserved zinc Iigand residues, it has been shown that a number of 
other positions are also important for the structural integrity of the C2H2 zinc fingers. 
(Rosenfekter aL, J. BiomoL Struct. Dyn. (1993) //:557) The best conserved position is 
found four residues after the second cysteine; it is generally an aromatic or aliphatic residue. 

The consensus pattern for C2H2 zinc fingers is: C-x(2,4>C-x(3>[LIVMFYWC]- 
x(8)-H-x(3,5)-H. The two C's and two Hs are zinc ligands. 

u) Zinc Finger. CCHC Class. SEQ ID NO:322 corresponds to a polynucleotide 
encoding a novel member of the zinc finger CCHC family. The CCHC zinc finger protein 
family to date has been mostly composed of retroviral gag proteins (nucleocapsid). The 
prototype structure of this family is from HIV. The family also contains members involved 
in eukaryotic gene regulation, such as C. elegans GLH-1. The consensus sequence of this 
family is based upon the common structure of an 18-residue zinc finger. 
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v) Zinc-Binding Metal lo pro tease Domain. SEQ ID NO:306 and 395 represent 

polynucleotides encoding novel members of the zinc-binding metalloprotease domain 

protein family. The majority of zinc-dependent metallopeptidases (with the notable 

exception of the carboxypeptidases) share a common pattern of primary structure (Jongeneel 

5 et al. , FEES Lett. ( 1 989) 242:2 1 1 ; Murphy et at. , FEBS Lett. (1 99 1 ) 289:4; and Bode et at. , 

Zoology (1996) 99:237) in the part of their sequence involved in the binding of zinc, and can 

be grouped together as a superfamily, known as the metzincins, on the basis of this sequence 

similarity. Examples of these proteins include: 1) Angiotensin-con verting enzyme (EC 

3.4.15.1) (dipeptidyl carboxypeptidase I) (ACE), the enzyme responsible for hydrolyzing 

10 angiotensin I to angiotensin II. 2) Mammalian extracellular matrix metalloproteinases 
(known as matrixins) (Woessner, FASEB J. (1991) 5:2145): MMP-1 (EC 3.4.24.7) 
(interstitial collagenase), MMP-2 (EC 3.4.24.24) (72 Kd gelatinase), MMP-9 (EC 3.4.24.35) 
(92 Kd gelatinase), MMP-7 (EC 3.4.24.23) (matryiisin), MMP-8 (EC 3.4.24.34) (neutrophil 
collagenase), MMP-3 (EC 3.4.24.17) (stromelysin-1), MMP-10 (EC 3.4.24.22) 

1 5 (stromelysin-2), and MMP-1 1 (stromelysin-3), MMP-1 2 (EC 3.4.24.65) (macrophage 
metalloelastase). 3) Endothelin-converting enzyme 1 (EC 3.4.24.71) (ECE-1), which 
processes the precursor of endothelin to release the active peptide. 

A signature pattern which includes the two histidine and the glutamic acid residues is 
sufficient to detect this superfamily of proteins, having the consensus pattern: [GSTALIVN]- 

20 x(2)-H-E-[LIVMFYW]-{DEHRKP}-H-x-[LIVMFWGS The two tTs are zinc 
ligands, and E is the active site residue. 

Example 4: Differential Expression of Polynucleotides of the Invention : Description of 
Libraries and Detection of Differential Expression 

25 The relative expression levels of the polynucleotides of the invention was assessed in 

several libraries prepared from various sources, including cell lines and patient tissue 
samples. Table 4 provides a summary of these libraries, including the shortened library 
name (used hereafter), the mRNA source used to prepared the cDNA library, the "nickname" 
of the library that is used in the tables below (in quotes), and the approximate number of 

30 clones in the library. 
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Table 4 Description of eDNA Libraries 



Library 1 Description 

(lib#) 


Number of 
Clones in this 
Clustering 


1 Kml2L4 

Human Colon Cell Line, High Metastatic Potential (derived 
from Km 1 2C) 
"High Colon" 


307133 


2 Kml2C " — 

Human Colon Cell Line, Low Metastatic Potential 
"Low Colon" 

3 MDA-MR-71 1 * 


284755 



Human Breast Cancer Cell Line, High Metastatic Potential; 
micro- metastases in lung 
"High Breast" 

MCF7 ■ 

Human Breast Cancer Cell, Non Metastatic 
"Low Breast" 



326937 



318979 



MV-522 " 

Human Lung Cancer Cell Line, High Metastatic Potential 
"High Lung" 



223620 



UCP-3 



Human Lung Cancer Cell Line, Low Metastatic Potential 
"Low Lung" 

Human microvascular endothelial cells (HMEC) - Untreated 
PCR (OiigodT) cDNA library 



312503 



41938 



13 



Human microvascular endothelial cells (HMEC) - bFGF 
treated 

PCR (OiigodT) cDNA library 

Human microvascular endothelial cells (HMEC) - VEGF 
treated 

PCR (OiigodT) cDNA library 



42100 



14 



42825 



15 I Normal Colon - UC#2 Patient 
PCR (OiigodT) cDNA library 
"Normal Colon Tumor Tissue" 

16 4 .Colon Tumor- UC#2 Patient 
PCR (OiigodT) cDNA library 

[^-Normal Colon Tumor Tissue" 



34285 



35625 



17 



Liver Metastasis from Colon Tumor of UC#2 Patient 

PCR (OiigodT) cDNA library 

"High Colon Metastasis Tissue" 

Normal Colon - UC#3 Patient " 

PCR (OiigodT) cDNA library 

"Normal Colon Tumor Tissue" 



36984 



36216 



19 I Colon Tumor - UC#3 Patient 
PCR (OiigodT) cDNA library 
"High Colon Tumor Tissue" 

20 I Liver Metastasis from Colon Tumor of UC#3 Patient 
PCR (OiigodT) cDNA library 
"High Colon Metastasis Tissue" 



41388 



30956 
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The KM12L4 and KM12C cell lines arc described in Example 1 above. The MDA- 

MB-231 cell line was originally isolated from pleural effusions (Cailleau, J. Natl Cancer. 
Inst. (1974) 53:661), is of high metastatic potential, and forms poorly differentiated 
adenocarcinoma grade II in nude mice consistent with breast carcinoma. The MCF7 cell line 
5 was derived from a pleural effusion of a breast adenocarcinoma and is non-metastatic. The 
MV-522 cell line is derived from a human lung carcinoma and is of high metastatic 
potential. The UCP-3 cell line is a low metastatic human lung carcinoma cell line; the MV- 
522 is a high metastatic variant of UCP-3. These cell lines are well-recognized in the art as 
models for the study of human breast and lung cancer (see, e.g., Chandrasekaran et aL, 
10 Cancer Res. (1979) 5P:870 (MDA-MB-231 and MCF-7 ); Gastpar et al.,JMed Chem (1998) 
*7:4965 (MDA-MB-231 andMCF-7); Ranson et al , Br J Cancer ( 1 998) 77:1586 (MDA- 
MB-23 1 and MCF-7); Kuang et aL 9 Nucleic Acids Res (1998) 26:1116 (MDA-MB-23 1 and 
MCF-7); Varki et al 9 Int J Cancer (1987) 40:46 (UCP-3); Varki et aL, Tumour Biol. (1990) 
77:327; (MV-522 and UCP-3); Variri et aL, Anticancer Res. (1990) 70:637; (MV-522); 
1 5 Kelner et a/.*, Anticancer Res ( 1 995) 75:867 (MV-522); and Zhang et aL , Anticancer Drugs 
(1997) 8:696 (MV522)). The samples of libraries 15-20 are derived from two different 
patients (UC#2, and UC#3). 

Each of the libraries is composed of a collection of cDNA clones that in turn are 
representative of the mRNAs expressed in the indicated mRNA source. In order to facilitate 
20 the analysis of die millions of sequences in each library, the sequences were assigned to 
clusters. The concept of "cluster of clones" is derived from a sorting/grouping of cDNA 
clones based on their hybridization pattern to a panel of roughly 300 7bp oligonucleotide 
probes (see Drmanac et aL 9 Genomics (1996) 57(1):29). Random cDNA clones from a 
tissue library are hybridized at moderate stringency to 300 7bp oligonucleotides. Each 
15 oligonucleotide has some measure of specific hybridization to that specific clone. The 
combination of 300 of these measures of hybridization for 300 probes equals the 
"hybridization signature" for a specific clone. Clones with similar sequence will have 
similar hybridization signatures. By developing a sorting/grouping algorithm to analyze 
these signatures, groups of clones in a library can be identified and brought together 
50 computationally. These groups of clones are termed "clusters". Depending on the 

stringency of the selection in the algorithm (similar to the stringency of hybridization in a 
classic library cDNA screening protocol), the "purity" of each cluster can be controlled. For 
example, artifacts of clustering may occur in computational clustering just as artifacts can 

occur in "wet-lab" screening of a cDNA library with 400 bp cDNA fragments, at even the 
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highest stringency. The stringency used in the implementation of cluster herein provide^ 0 
groups of clones that are in general from the same cDNA or closely related cDNAs. Closely 
related clones can be a result of different length clones of the same cDNA, closely related 
clones from highly related gene families, or splice variants of the same cDNA. 

Differential expression for a selected cluster was assessed by first determining the 
number of cDNA clones corresponding, to the selected cluster in the first library (Clones in 
1 st ), and the determining the number of cDNA clones corresponding to the selected cluster in 
the second library (Clones in 2 nd ). Differential expression of the selected cluster in the first 
library relative to the second library is expressed as a "ratio" of percent expression between 
the two libraries. In general, the "ratio" is calculated by: 1) calculating the percent 
expression of the selected cluster in the first library by dividing the number of clones 
corresponding to a selected cluster in the first library by the total number of clones analyzed 
from the first library; 2) calculating the percent expression of the selected cluster in the 
second library by dividing the number of clones corresponding to a selected cluster in a 
second library by the total number of clones analyzed from the second library; 3) dividing 
the calculated percent expression from the first library by the calculated percent expression 
from the second library. If the "number of clones" corresponding to a selected cluster in a 
library is zero, the value is set at 1 to aid in calculation. The formula used in calculating the 
ratio takes into account the "depth" of each of the libraries being compared, i.e., the total 
number of clones analyzed in each library. 

In general, a polynucleotide is said to be significantly differentially expressed 
between two samples when the ratio value is greater than at least about 2, preferably greater 
than at least about 3, more preferably greater than at least about 5 , where the ratio value is 
calculated using the method described above. The significance of differential expression is 
determined using a z score test (Zar. Biostatistical Analysis P^nt,v» t„„ !T c. a 
"Differences between Proportions," pp 296-298 (1974). 

Tables 5 to 7 (inserted before the claims) show the number of clones in each of the 
above libraries that were analyzed for differential expression. Examples of differentially 
expressed polynucleotides of particular interest are described in more detail below. 

Example 5: Polynucleotides Differentially P assed in High Metastatic Pnf„nti a i Rr~,c* 
Cancer Cells Versus t ,q W Metastatic Breast Cancer Cells 
A number of polynucleotide sequences have been identified that are differentially 
expressed between cells derived from high metastatic potential breast cancer tissue and low 
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metastatic breast cancer cells. Expression of these sequences in breast cancer can be 

valuable in determining diagnostic, prognostic and/or treatment information. For example, 

sequences that are highly expressed in the high metastatic potential cells can be indicative of 

increased expression of genes or regulatory sequences involved in the metastatic process. A 

5 patient sample displaying an increased level of one or more of these polynucleotides may 

thus warrant more aggressive treatment In another example, sequences that display higher 

expression in the low metastatic potential cells can be associated with genes or regulatory 

sequences that inhibit metastasis, and thus the expression of these polynucleotides in a 

sample may warrant a more positive prognosis than the gross pathology would suggest 

10 The differential expression of these polynucleotides can be used as a diagnostic 

marker, a prognostic marker, for risk assessment, patient treatment and the like. These 
polynucleotide sequences can also be used in combination with other known molecular 
and/or biochemical markers. 

The following table summarizes identified polynucleotides with differential 

15 expression between high metastatic potential breast cancer cells and low metastatic potential 
breast cancer cells. 
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1 able 8. Differentially expressed polynucleotides: High metastatic potential breast 

cancer vs. low metastatic breast cancer cells 



SEQED 
NO. 



Differential Expression 



Cluster 
ID 



9 

42 
• 52 
62 
65 
. 66. 
68 
114 
123 
144 
172 
178 
214 
219 
223 
258 
317 
- 379 
4 



39 

74 

81 

130 

157 

162 

183 

202 

298 

338 

384 

386 

388 



Clones in 
Library 



High Breast > Low Breast (Ub3 > Lib4) 2623 31 

High Breast > Low Breast (Lib3 > Lib4) 307 1 96 

High Breast > Low Breast (Lib3>Lib4) 19 1364 

High Breast > Low Breast (Lib3>Lib4) 2623 31 

High Breast > Low Breast (Lib3 > Lib4) 5749 9 

High Breast > Low Breast (Lib3 > Lib4) 6455 6 

High Breast > Low Breast (Lib3>Lib4) 6455 6 

High Breast > Low Breast (Lib3 > Lib4) 2030 32 

High Breast > Low Breast (Lib3 > Lib4) 3389 13 

High Breast > Low Breast (Lib3 > Lib4) 4623 1 2 

High Breast > Low Breast (Lib3 > Lib4) 1 02 278 

High Breast > Low Breast (Lib3 > Lib4) 368 1 10 

High Breast > Low Breast (Lib3 > Lib4) 3900 8 

High Breast > Low Breast (Lib3 > Lib4) 3389 13 

High Breast > Low Breast (Lib3 > Lib4) 1 399 19 

High Breast > Low Breast (Lib3 > Lib4) 483 7 10 

High Breast > Low Breast (Lib3 > Lib4) 1 577 25 

High Breast > Low Breast (Lib3 > Lib4) 260 27 



Clones in 
2 nd 

Library 



Ratio 



Low Breast > High Breast (Lib4>Lib3) 3706 22" 

Low Breast > High Breast (Lib4 > Lib3) 40 1 6 6 

Low Breast > High Breast (Lib4>Lib3) 6268 18 

Low Breast > High Breast (Lib4 > Lib3) 40392 8 

Low Breast > High Breast (Lib4>Lib3) 13183 7 

Low Breast > High Breast (Lib4>Lib3) 5417 9 

Low Breast > High Breast (Lib4 > Lib3) 9685 7 

Low Breast > High Breast (Lib4>Lib3) 7337 16 

Low Breast > High Breast (Lib4 > Lib3) 6 124 9 

Low Breast > High Breast (Lib4 > Lib3) 1 037 22 

Low Breast > High Breast (Lib4 > Lib3) 689 36 

Low Breast > High Breast (Lib4 > Lfl>3) 697 72 

Low Breast > High Breast (Lib4>Lib3) 4568 9 

Low Breast > High Breast (Lib4>Lib3) 5622 13 



4 

75 

525 

4 

0 

0 

0 

4 

2 

2 

116 

1 

1 

2 
7 
0 
3 
2 



4 
0 
3 
1 

0 

0 

0 
3 

1 
4 

17 
30 
0 
2 



7.561356 

2.549721 

2.534854 

7.561356 

8.780930 

5.853953 

5.853953 

7.805271 

6.341782 

5.853953 

2.338217 

9.756589 

7.805271 

6.341782 

2.648217 

9.756589 

8.130490 

13.17139 



5.637215 

6.149690 

6.149690 

8.199586 

7.174638 

9.224535 

7.174638 

5.466391 

9.224535 

5.637215 

2.170478 

2.459876 

9J224535 

6.662164 
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Example6: Polynucleotides Differentially Pressed j n High Metastatic Pnwi*! r „ np 
Cancer Cells Versus T. q W Metastatic Lung Cancer Cells 
A number of polynucleotide sequences have been identified that are differentially 
expressed between cells derived from high metastatic potential lung cancer tissue and low 
metastatic lung cancer cells. Expression of these sequences in lung cancer tissue can be 
valuable in detennining diagnostic, prognostic and/or treatment information. For example, 
sequences that are highly expressed in the high metastatic potential cells are associated can 
be indicative of increased expression of genes or regulatory sequences involved in the 
metastatic process. A patient sample displaying an increased level of one or more of these 
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polynucleotides may thus warrant more aggressive treatment In another example, 

sequences that display higher expression in the low metastatic potential cells can be 

associated with genes or regulatory sequences that inhibit metastasis, and thus the expression 

of these polynucleotides in a sample may warrant a more positive prognosis than the gross 

5 pathology would suggest 

The differential expression of these polynucleotides can be used as a diagnostic 

marker, a prognostic marker, for risk assessment, patient treatment and the like. These 

polynucleotide sequences can also be used in combination with other known molecular 

and/or biochemical markers. 

10 The following table summarizes identified polynucleotides with differential 

expression between high metastatic potential lung cancer cells and low metastatic potential 

lung cancer cells: 

Table 9 Differentially expressed polynucleotides: High metastatic potential lung 
1 5 cancer vs. low metastatic lung cancer cells 



SEQW 


Differential Expression 


Cluster 


Clones in 


Clones in 


Ratio 


NO. 




ED 


Library 


Library 




400 


High Lung > Low Lung (Lib8 > Lib 9) 


14929 


23 


16 


2.008868 


9 


High Lung > Low Lung (Lib8 > Lib9) 


2623 


6 


1 


8384840 


34 


High Lung > Low Lung (Lib8 > Lib9) 


5832 


5 


0 


6.987366 


42 


High Lung > Low Lung (Lib8 > Lib9) 


307 


79 


27 


4.088903 


62 


High Lung > Low Lung (Lib8 > Lib9) 


2623 


6 


1 


8.384840 


74 


High Lung > Low Lung (Ltb8 > Lib9) 


6268 


5 


0 


6.987366 


106 


High Lung > Low Lung (Lib8 > Lib9) 


10717 


8 


0 


11.17978 


119 


High Lung > Low Lung (Lib8 > Lib9) 


8 


1355 


122 


15.52111 


361 


High Lung > Low Lung (Lib8 > Lib9) 


1120 


5 


0 


6.987366 


369 


High Lung > Low Lung (Lib8 > Lib9) 


2790 


6 


0 


8384840 


371 


High Lung > Low Lung (Lib8 > Lib9) 


8847 


6 


1 


8384840 


379 


High Lung > Low Lung (Lib8 > Iib9) 


260 


15 


0 


20.96210 


395 


High Lung > Low Lung (Lib8 > Ltb9) 


13538 


9 


1 


12.57726 


135 


Low Lung > High Lung (Lib9 > Lib8) 


36313 


30 


1 


21.46731 


154 


Low Lung > High Lung (Lib9 > Lib8) 


5345 


27 


6 


3.220097 


160 


Low Lung > High Lung (Lib9 > Lib8) 


4386 


21 


3 


5.009039 


260 


Low Lung > High Lung (Lib9 > Ub8) 


4141 


27 


4 


4.830145 


308 


Low Lung > High Lung (Lib9 > Lib8) 


15855 


213 


12 


12.70149 


323 


Low Lung > High Lung (Lib9 > Lib8) 


5257 


25 


5 


3.577885 


349 


Low Lung > High Lung (Lib9 > Lib8) 


2797 


14 


1 


10.01807 


381 


Low Lung > High Lung (Lib9 > Lib8) 


2428 


19 


2 


6.797982 



Example 7: Polynucleotides Differentially Expressed in High Metastatic Potential Colon 
Cancer Cells Versus Low Metastatic Colon Cancer Cells 
A number of polynucleotide sequences have been identified that are differentially 
0 expressed between cells derived from high metastatic potential colon cancer tissue and low 
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metastatic colon Cancer cells. Expression of these sequences in colon cancer tissue can be 
valuable in determining diagnostic, prognostic and/or treatment information. For example, 
sequences that are highly expressed in the high metastatic potential cells can be indicative of 
increased expression of genes or regulatory sequences involved in the metastatic process. A 
patient sample displaying an increased level of one or more of these polynucleotides may 
thus warrant more aggressive treatment In another example, sequences that display higher 
expression in the low metastatic potential cells can be associated with genes or regulatory 
sequences that inhibit metastasis, and thus the expression of these polynucleotides in a 
sample may warrant a more positive prognosis than the gross pathology would suggest. 

The differential expression of these polynucleotides can be used as a diagnostic 
marker, aprognostic marker, for risk assessment, patient treatment and the like. These 
polynucleotide sequences can also be used in combination with other known molecular 
and/or biochemical markers. . 

The following table summarizes identified polynucleotides with differential 
1 5 expression between high metastatic potential colon cancer cells and low metastatic potential 
colon cancer cells: 

Table 11: Differentially expressed polynucleotides: High metastatic potential colon 
cancer vs. low metastatic colon cancer cells 

SEQID DifTerential Expression Cluster Clones in Clones in Ratio 

NO. n) J* 2"* 

i Library Library 

1 High Colon > Low Colon (Lib I > Lib2) 6660 7 0 6 489973 

176 High Colon > Low Colon (Lib !>Lib2) 3765 19 6 2 935940 

241 High Colon > Low Colon (Libl>Lib2) 4275 11 2 s!o99264 

362 High Colon > Low Colon (Libl > Lib2) 6420 8 0 7 417112 

374 High Colon > Low Colon CLibl >Lib2) 6420 8 0 7 417112 

3* .. Low Colon > High Colon (Ub2 > Libl) 4016 14 5 3 020043 

97 Low Colon > High Colon (Lib2> Libl) 945 21 9 2^516702 

134 Low Colon > High Colon (Lib2> Libl) 2464 19 5 4 098630 

317 Low Colon > High Colon (Lib2> Libl) 1577 40 12 3^595289 

357 Low Colon > High Colon (Lib2> Libl) 4309 13 4 3.505407 

20 Example 8: Polynucleotides Differentiall y Expressed at Higher Levels in High Metastatic 
Potential Colon Cancftr Patient Tissue Versus Normal Patient Tissue 
A number of polynucleotide sequences have been identified that are differentially 
expressed between cells derived from high metastatic potential colon cancer tissue and 
normal tissue. Expression of these sequences in colon cancer tissue can be valuable in 
25 determining diagnostic, prognostic and/or treatment information. For example, sequences 
that are highly expressed in the high metastatic potential cells are associated can be 
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indicative of increased expression of genes or regulatory sequences involved in the advanced 

disease state which involves processes such as angiogenesis, dedifferentiation, cell 
replication, and metastasis. A patient sample displaying an increased level of one or more of 
these polynucleotides may thus warrant more aggressive treatment 

The differential expression of these polynucleotides can be used as a diagnostic 
marker, a prognostic marker, for risk assessment, patient treatment and the like. These 
polynucleotide sequences can also be used in combination with other known molecular 
and/or biochemical markers. 

The following table summarizes identified polynucleotides with differential 
expression between high metastatic potential colon cancer cells and normal colon cells: 



Table 11: Differentially expressed polynucleotides: High metastatic potential colon 
tissue vs. normal colon tissue 



SEQID 


Differential Expression 


Cluster 


Clones in 


Clones in 


Ratio 


NO. 




ID 


1* 


2 nd 






High Colon Metastasis Tissue > Normal 
Colon Tissue of UC#3 (Lib20 > Lib 18) 




Library 


Library 




52 


19 


10 


0 


11.69918 


52 


High Colon Metastasis Tissue > Normal 
Tissue in UC#2 (Lib 1 7 > Lib 1 5) 


19 


13 


2 


6.025646 


172 


High Colon Metastasis Tissue > Normal 
Tissue in UC#2 (Libl7 > Lib 15) 


102 


65 


22 


2.738930 



Example 9: Polynucleotides Differentially Expressed at Higher Levels in High Colon 
Tumor Potential Patient Tissue Versus Metastasized Colon Cancer Patient 
Tissue 

A number of polynucleotide sequences have been identified that are differentially 
expressed between cells derived from high tumor potential colon cancer tissue and cells 
derived from high metastatic potential colon cancer cells. Expression of these sequences in 
colon cancer tissue can be valuable in determining diagnostic, prognostic and/or treatment 
information associated with the transformation of precancerous tissue to malignant tissue. 
This information can be useful in the prevention of achieving the advanced malignant state 
in these tissues, and can be important in risk assessment for a patient 

The following table summarizes identified polynucleotides with differential 
expression between high tumor potential colon cancer tissue and cells derived from high 
metastatic potential colon cancer cells: 
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Table 12: Differentially expressed polynucleotides: High tumor potential colon tissue 



SEQID 
NO.- 

52 

119 

172 



vs. metastatic colon tissue 
Differential Expression 

High Colon Tumor Tissue > Metastasis 
Tissue of UC#3 (Lib 1 9 > Lib20) 
High Colon Tumor Tissue > Metastasis 
Tissue of UC#3 (Lib 1 9 > Lib20) 
High Colon Tumor Tissue > Metastasis 
Tissue of UC#3 (Lib 19 > Lib20) 



Cluster Clones in l" Clones in 2 nd Ratio 



ID 



19 
8 

102 



69 
14 

43 



Library Library 



10 
1 

10 



5.160829 
10.47124 
3.2I6I68 



5 Example 10: Polynucleotides Differential ly Expressed at Higher Lewis in Hit* Tnmnr 
Potential Colon Cancer Pati ent Tissue Versus Normal Patient Tissue 
A number of polynucleotide sequences have been identified that are differentially 
expressed between cells derived from high tumor potential colon cancer tissue and normal 
tissue. Expression of these sequences in colon cancer tissue can be valuable in determining 
1 0 diagnostic, prognostic and/or treatment information associated with the prevention of 

achieving the malignant state in these tissues, and can be important in risk assessment for a 
patient For example, sequences that are highly expressed in the potential colon cancer cells 
are associated with or can be indicative of increased expression of genes or regulatory 
sequences involved in early tumor progression- A patient sample displaying an increased 
15 level of one or more of these polynucleotides may thus warrant closer attention or more 
frequent screening procedures to catch the malignant state as early as possible. 

The following table summarizes identified polynucleotides with differential 
expression between high metastatic potential colon cancer cells and normal colon cells: 

20 Table 13: Differentially expressed polynucleotides: High tumor potential colon tissue 
vs. normal colon tissue 



SEQED 


Differential Expression 


Cluster 


Clones in 


Clones in 


Ratio 


NO. 




ID 


1* 


2 nd 


52 


High Colon Tumor Tissue > Normal Tissue 


19 


Library 
13 


Library 

2 


6.255508 




ofUC#2(Lib!6>Libl5) 








288 


High Colon Tumor Tissue > Normal Tissue 


1267 


7 


0 


6.125253 


52 


ofUC#2(Libl6>Libl5) 








High Colon Tumor Tissue > Normal Tissue 


19 


69 


0 


60.37750 




ofUC#3 (Lib 19 > Lib 18) 








119 


High Colon Tumor Tissue > Normal Tissue 


8 


14 


1 


12.25050 




ofUC#3 (Libl9>LibI8) 






172 


High Colon Tumor Tissue > Normal Tissue 


102 


43 


7 


5.375222 




ofUC#3 (Lib 19 > Lib 18) 
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Example 1 1 : Polynucleotides Differentially Expressed Across Multiple Libraries 

A number of polynucleotide sequences have been identified that are differentially 
expressed between cancerous cells and normal cells across all three tissue types tested (ie, 
5 breast, colon, and lung). Expression of these sequences in a tissue or any origin can be 

valuable in determining diagnostic, prognostic and/or treatment information associated with 
the prevention of achieving the malignant state in these tissues, and can be important in risk 
assessment for a patient These polynucleotides can also serve as non-tissue specific 
markers of, for example, risk of metastasis of a tumor. The following table summarizes 
10 identified polynucleotides that were differentially expressed but without tissue type- 
specificity in the breast, colon, and lung libraries tested. 



Table 14: Polynucleotides Differentially Expressed Across Multiple Library Comparisons 



oEAj ID 


Differential Expression 


Cluster 


Clones in 


Clones in 


Ratio 






ID 


l 


Z 










Library. 


Library 




o 
y 


riign crease low oreasi (jlidj ^ L1D4 ) 






4 


7.561356 




LJiaK I lino "!>. F «%u/ T jino f\ *Kfi *^ T ikO\ 

riign LAuig low idling \lido ^ J-407J 




O 


1 

1 


o*J54o4U 


yy 


low crease ^ riign crease ^jlioh jlidj ) 


4UI6 


J? 

o 


U 


6.149690 




low uoion > irtgn uoion (lio^ > lid 1 ) 


4016 


1 A 

14 


5 


3.020043 




riign crease > low crease (lid J > lid4j 


307 


196 


75 


2.549721 




F-Ff«*lt F imfr ^ F aus F linn f\ >w T ikO\ 

rugn Lung low Lung ^lido lAuy) 


in*7 


"70 

iy 


z/ 


4.UooyUJ 


52 


High Breast > Low Breast (Lib3 > Lib4) 


19 


1364 


525 


2.534854 




High Colon Metastasis Tissue > Normal 


19 


10 


0 


11.69918 




Colon Tissue of UC#3 (Lib20 > Lib 18) 












High Colon Metastasis Tissue > Normal 


19 


13 


2 


6.025646 




Tissue in UC#2(Libl7 > LiblS) 












High Colon Tumor Tissue > Metastasis 


19 


69 


10 


5.160829 




Tissue of UC#3 (Lib 19 > Lib20) 












High Colon Tumor Tissue > Normal Tissue 


19 


13 


2 


6.255508 




ofUC#2(Libl6> LiblS) 












High Colon Tumor Tissue > Normal Tissue 


19 


69 


0 


60J7750 




of UC#3 (Libl9 > Libl8) 










62 


High Breast > Low Breast (Lib3 > Lib4) 


2623 


31 


4 


7.561356 




High Lung > Low Lung (Lib8 > Lib9) 


2623 


6 


1 


8.384840 


74 


High Lung > Low Lung (LibS > Lib9) 


6268 


5 


0 


6.987366 




Low Breast > High Breast (Lib4 > Lib3) 


6268 


18 


3 


6.149690 


119 


High Colon Tumor Tissue > Metastasis 


8 


14 


1 


10.47124 




Tissue of UC#3 (Libl9 > Lib20) 












High Colon Tumor Tissue > Normal Tissue 


8 


14 


1 


12.25050 




ofUC#3 (LibI9>Libl8) 












High Lung > Low Lung (Lib8 > Lib9) 


8 


1355 


122 


15.521 1 1 


172 


High Breast > Low Breast (Lib3 > Lib4) 


102 


278 


116 


2.338217 




High Colon Metastasis Tissue > Normal 


102 


65 


22 


2.738930 




Tissue in UC#2 (Libl 7 > LiblS) 












High Colon Tumor Tissue > Metastasis 


102 


43 


10 


3.216168 
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SEQID Differential Expression Cluster Clones in Clones in Ratio 

NO. ro 2 nd 

Library Library 

Tissue of UC#3 (Lib 1 9 > Lib20) 





High Colon Tumor Tissue > Normal Tissue 
ofUC#3 (LibI9> LiblS) 


102 


43 


7 


5.375222 


317 


High Breast > Low Breast (Lib3 > Lib4) 
Low Colon > High Colon (Lib2 > Lib I) 


1577 
1577 


25 
40 


3 
12 


8.130490 
3.595289 


379 


High Breast > Low Breast (Lib3 > Lib4) 
High Lung > Low Lung (Lib8 > Lib9) 


260 
260 


27 
15 


2 
0 


13.17139 
20.96210 



Example 12: Polynucleo tides Exhibiting Colon-Sbecific Expression 

The cDNA libraries described herein were also analyzed to identify those 
polynucleotides that were specifically expressed in colon cells or tissue, i.e., the 
polynucleotides were identified in libraries prepared from colon cell lines or tissue, but not 
in libraries of breast or lung origin. The polynucleotides that were expressed in a colon cell 
line and/or in colon tissue, but were present in the breast or lung cDNA libraries described 
herein, are shown in Table 15. 



1 0 Table 15 Polynucleotides specifically expressed in colon cells. 



SEQID 
NO. 



Cluster 



5 


36535 


2 


13 


27250 


2 


19 


16283 


3 


24 


16918 


4 


26 


40108 


2 


32 


32663 


1 


43 


39833 


2 


47 


18957 


3 


48 


39508 


2 


56 


7005 


8 


58 


18957 


3 


59 


18957 


3 


60 


16283 


3 


64 


13238 


4 


70 


39442 


2 


71 


17036 


4 


73 


7005 


8 


83 


11476 


6 


86 


39425 


2 


94 


21847 


2 


100 


16731 


3 


101 


12439 


4 


113 


17055 


4 


120 


67907 


1 


121 


12081 


4 


124 


39174 


2 



Clones in Clones in 
1** Library 2 nd Library 
0 
0 
0 
0 
0 

1 

0 
0 
0 
2 
0 
0 
0 

1 

0 
0 
2 
0 
0 
I 
1 
0 
0 
0 
0 
0 



SEQID 
NO. 



Cluster 



Clones in 
1* Librarv 



229 


39648 


2 


0 


231 


85064 


1 


0 


234 


39391 


2 


0 


236 


39498 


2 


0 


242 


22113 


3 


0 


247 


19255 


2 


0 


252 


22814 


3 


0 


253 


39563 


2 


0 


254 


39420 


2 


0 


257 


39412 


2 


0 


261 


38085 


2 


0 


265 


40054 


1 


0 


266 


39423 


2 


0 


267 


39453 


2 


0 


270 


78091 


1 


0 


276 


39168 


2 


0 


277 


39458 


2 


0 


278 


14391 


3 


I 


279 


39195 


2 


0 


282 


12977 


5 


0 


284 


14391 


3 


1 


290 


16347 


4 


0 


293 


39478 


2 


0 


294 


39392 


2 


0 


297 


39180 


2 


0 


299 


6867 


7 


3 



Clones in 
2 nd Library 
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SEQ ID Cluster Clones in Clones in SEQ ID Cluster Clones in Clones in 
NO. l u Library 2 nd Library NO. I st Library 2 nd Library 



1Z6 


fl7 f A 
OZIO 


Z 


o 


1 A t 

301 


41633 


1 


1 


1 711 
1Z5 


41J4 J 3 


z 


u 


JOZ 


23218 


3 


0 


1 70 


77 IO^ 




n 
u 


303 


1 £\*J OA 

39380 


2 


A 

0 


1 At 


OOOJ7 


I 
1 


A 

u 


1 An 
30V 


o43ZO 


t 
1 


A 
0 


1 <A 
1 jU 


HA77 

oo /z 


*T 




314 


1436/ 


3 


A 
0 




1 AQ77 


A 
*t 


A 


77A 

- 3Z0 


1 0ff 9 A 


Z 


A 

• 0 


t JO 


1 /vJO 


H 


A 


3Z4 


OAA1 

vooi 


5 




1 <Q 


*rU\rH 


7 
z 


A 
U 


777 
3Z / 


1 AA<1 
10O53 


3 


■ 

I 


lOl 


*tuvw*t 


9 

z 


A 


77ff 
JZo 


1 AQSK 


4 


A 

0 


1 A7 
I OJ 


77 1 <^ 


j 


* A 


770 * 
3ZV 


1 7077 

izy / / 


< 
5 


A 


too 


i ^A£/; 

1 JUOO 


H 


A 


71A 
33U 


OAAI 

yooi 


c 


Z 


1 7 A 
1 /U 




-J 


A 


jjj 


lojyz 


1 
j 


A 
U 


1 7 A 
1 /O 


J too 


I O 

i y 


o 


7,47 
34Z 


3y4oo 


Z 


A 
0 


1 C 1 


HAi IA 

OO 1 IV 


i 


A 
U 


1AA 
344 


05/4 


6 


3 


1 C7 

loz 


jyo4o 


z 


A 

U 


34} 


oo /4 


6 


3 


loO 


1 7A7A 


*f 


A 

u 


353 


i I4y4 


4 


A 

0: 


loo 


7770M 
ZZ/y4 


Z 


A 


■a CA 

354 


17062 


3 


0 


1 1*7 
15/ 


701 71 

jyi / 1 


Z 


A 
V 


7<< 

355 


16245 


4 


A • 

0 


1V4 


Af\A« 
40455 


Z 


A 
0 


356 


83103 


1 


0 


1 Art 


16317 




A 
0 


1 CO 

358 


13072 


4 


f 

I 


210 


39 loo 


z 


A 
O 


366 


14364 


1 


0 


211 


40122 


2 


0 


368 


84182 


1 


0 


218 


26295 


2 


0 


372 


56020 


1 


0 


222 


4665 


5 


9 


389 


7514 


5 


3 


226 


82498 


1 


0 


391 


7570 


5 


3 


227 


35702 


2 


0 


393 


23210 


3 


0 



In addition to the above, SEQ ID NOS:159 and 161 were each present in one clone in 
each of Libl6 (Normal Colon Tumor Tissue), and SEQ ID NOS:344 and 345 were each 
present in one clone in Libl 7 (High Colon Metastasis Tissue). No clones corresponding to 
the colon-specific polynucleotides in the table above were present in any of Libraries 3, 4, 8, 
or 9. The polynucleotide provided above can be used as markers of cells of colon origin, and 
find particular use in reference arrays, as described above. 

Example 13: Identification of Contiguous Sequences Having a Polynucleotide of the 
Invention 

The novel polynucleotides were used to screen publicly available and proprietary 
databases to determine if any of the polynucleotides of SEQ ID NOS: 1-404 would facilitate 
identification of a contiguous sequence, e.g. y the polynucleotides would provide sequence 
that would result in 5' extension of another DNA sequence, resulting in production of a 
longer contiguous sequence composed of the provided polynucleotide and the other DNA 
sequence(s). Contiging was performed using the AssemblyLign program with the following 
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parameters: 1) Overlap: Minimum Overlap Length: 30; % Stringency: 50; Minimum 
Repeat Length: 30; Alignment: gap creation penalty: 1.00, gap extension penalty: 1.00; 2) 
Consensus: % Base designation threshold: 80. 

Using these parameters, 44 polynucleotides provided contiged sequences. These 
contiged sequences are provided as SEQ ID NOS:80I-844. The contiged sequences can be 
correlated with the sequences of SEQ ID NOS:1-404 upon which the contiged sequences are 
based by identifying those sequences of SEQ ID NOS: 1-404 and the contiged sequences of 
SEQ ID NOS .-80 1-844 that share the same clone name in Table 1. It should be noted that of 
these 44 sequences that provided a contiged sequence, the following members of that group 
of 44 did not contig using the overlap settings indicated in parentheses (Stringency/Overlap): 
SEQ ID*rO:804 (30%/10); SEQ ID NO:810 (20%/20); SEQ ID NO:812 (30%/10); SEQ ID 
NO:814 (40%/20); SEQ ID NO:816 (30%/10); SEQ ID NO:832 (30%/10); SEQ ID NO:840 
(20%/20);SEQ ID NO:84I (40%/20). To generalize, the indicated polynucleotides did not 
contig using a minimum 20% stringency, 10 overlap. There was a corresponding increase in 
15 the number of degenerate codons in these sequences. 

The contiged sequences (SEQ ID NO:801-844) thus represent longer sequences that 
encompass a polynucleotide sequence of the invention. The contiged sequences were then 
translated in all three reading frames to determine the best alignment with individual 
sequences using the BLAST programs as described above for SEQ ID NOS: 1-404 and the 
20 validation sequences SEQ ID NOS:405-800. Again the sequences were masked using the 
XBLAST profiam for masking low complexity as described above in Example 1 (Table 2). 
Several of the contiged sequences were found to encode polypeptides having characteristics 
of a polypeptide belonging to a known protein families (and thus represent new members of 
these protein families) and/or comprising a known functional domain (Table 16). Thus the 
25 invention encompasses fragments, fusions, and variants of such polynucleotides that retain 
biological activity associated with the protein family and/or functional domain identified 
herein. 
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Table 16. Profile hits using contiged sequences 



SEQ 
ID 
NO. 


Sequence Name 


Profile 


Start 
(Stop) 




809 


Contig_RTA00000 1 77AF.n. 1 8.3. 
Seq_THC 123051 


ATPases 


778 
(1612) 


6040 


824 


Contig^RTAOOOOO 1 87AF.g.24. 1 . 
Seq_THCI68636 


homeobox 


531 
(707) 


12080 


824 


Contig RTAOO0O0 1 87AF.g.24. 1 . 
SeqJTHC 168636 


MAP kinase kinase 


769 
(1494) 


5784 


833 


Contig_RTA00000 1 90AFJ.4. 1 . 
Seq_THC228776 


protein kinase 


170 
(1010) 


5027 


833 


Contig_RTA00000 1 90AF.j.4. 1 . 
Seq_THC228776 


protein kinase 


170 
(1010) 


5027 



AH stop/start sequences are provided in the forward direction. 



5 The profiles for the ATPases (AAA) and protein kinase families are described above 

in Example 2. The homeobox and MAP kinase kinase protein families are described further 
below. 

Homeobo x domain. The Tiomeobox' is a protein domain of 60 amino acids (Gehring 
In: Guidebook to the Homeobox Genes. Duboule D., Ed., ppl-10, Oxford University Press, 

10 Oxford, (1994); Buerglin In: Guidebook to the Homeobox Genes. pp25-72, Oxford 

University Press, Oxford, (1994); Gehring Trends Biochem. ScL (1992) 77:277-280; Gehring 
etoIAnntc Rev. GeneL (1986) 20:147-173; Schofield Trends NeuroscL (1987) 70:3-6; 
http://copan.bioz.unibas.ch/ homeo Jitml) first identified in number of Drosophila homeotic 
and segmentation proteins. It is extremely well conserved in many other animals, including 

15 vertebrates. This domain binds DNA through a helix-turn-helix type of structure. Several 
proteins that contain a homeobox domain play an important role in development Most of 
these proteins are sequence-specific DNA-binding transcription factors. The homeobox 
dom a in is also very similar to a region of the yeast mating type proteins. These are 
sequence-specific DNA-binding proteins that act as master switches in yeast differentiation 

20 by controlling gene expression in a cell type-specific fashion. 

A schematic representation of the homeobox domain is shown below. The helix- 
turn-helix region is shown by the symbols 'HP (for helix), and 'f (for turn). 

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxHHHHHHHHtttHra 
5 1 60 
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The pattern detects homeobox sequences 24 residues long and spans positions 34 to 57 of the 

homeobox domain. The consensus pattern is as follows: [LIVMFYG]-[ASLVRJ-x(2)- 

[LIVMSTACN]-x-[LIVM]<4>[LIV]-[RKNQESTAIY]-[LIVFSTNK^ 
[NDQTAH]-x(5)-[RKNAIMW]. 

MAP kinase kinase (MAPKK) MAP kinases (MAPK) are involved in signal 
transduction, and are important in cell cycle and cell growth-controls. The MAP kinase 
kinases (MAPKK) are dual-specificity protein kinases which phosphorylate and activate 
MAP kinases: MAPKK homologies W been found in yeast, invertebrates, amphibians, 
and mammals. Moreover, the MAPKK/MAPK phosphorylation switch constitutes a basic 
module activated in distinct pathways in yeast and in vertebrates. MAPKK regulation 
studies have led to the discovery of at least four MAPKK convergent pathways in higher 
organisms. One of these is similar to the yeast pheromone response pathway which includes 
the stell protein kinase. Two other pathways require the activation of either one or both of 
the serine/threonine kinase-encoded oncogenes c-Raf-1 and c-Mos. Additionally, several 
studies suggest a possible effect of the cell cycle control regulator cyclin-dependent kinase 1 
(cdc2) on MAPKK activity. Finally, MAPKKs are apparently essential transducers through 
which signals must pass before reaching the nucleus. For review, see, e.g, Biologique Biol 
Cell (1993) 7P:193-207; Nishida et aL, Trends Biochem Set (1993) 75:128-31; Ruderman 
Curr Opin Cell Biol (1993) 5:207-13; Dhanasekaran etal, Oncogene (1998) 77:1447-55; 
Kiefer et aL , Biochem Soc Trans (1 997) 25:491-8; and Hill, Cell Signal (1 996) 5:533-44. 

Those skilled in the art will recognize, or be able to ascertain, using not more than 
routine experimentation, many equivalents to the specific embodiments of the invention 
described herein. Such specific embodiments and equivalents are intended to be 
encompassed by the following claims. 

All publications and patent applications cited in this specification are herein 
incorporated by reference as if each individual publication or patent application were 
specifically and individually indicated to be incorporated by reference. The citation of any 
publication is for its disclosure prior to the filing date and should not be construed as an 
admission that the present invention is not entitled to antedate such publication by virtue of 
30 prior invention. 
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Although the foregoing invention has been described in some detail by way of 

illustration and example for purposes of clarity of understanding, it is readily apparent to 

those of ordinary skill in the art in light of the teachings of this invention that certain changes 

and modifications may be made thereto without departing from the spirit or scope of the 

5 appended claims. 



Deposit Information : 

The following materials were deposited with the American Type Culture Collection: 
CMCC = (Chiron Master Culture Collection) 

Cell Lines Deposited with ATCC 



Cell Line 


Deposit Date 


ATCC Accession No. 


CMCC Accession No. 


KM12L4-A 


March 19, 1998 


CRL- 12496 


11606 


Kml2C 


May 15, 1998 


CRL- 12533 


11611 


MDA-MB-23 1 


May 15, 1998 


CRL-12532 


10583 


MCF-7 


October 9, 1998 


CRL-12584 


10377 
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Clone Name 
M000041I1D:A08 
M00004121B:GOI 
M00004121B:G01 
M00004121B.G0I 
M00004138B:H02 
M00004138B:H02 
M00004151D:B08 
M00004I69C.C12 
M00004169C:C12 
M06004169C:G*2 
M00004183C:D07 
M00004183C:D07 
M00004230B.C07 
M00004230B:C07 
M00004249D:F10 
M00004249D:F10 
M00004275C:C1 1 
M00004275C:C1 I 
M00004283B:A04 
M0000428SB:E08 
M00004327B:H04 
M00004377C.F05 
M00004384C:D02 
M00004384C:D02 
M00004461A.B08 
M00004461A:B09 
M00004691D:A05 
M00004896A.-C07 



Cluster ID 
6874 



13272 

13272 

16977 

5319 

5319 

5319 

16392 

16392 

7212 

7212 



16914 
16914 
14286 
56020 

2102 



Sequence Name 

99.F5.sp6:131294.Seq 

I77.H4.sp6:134791.Seq 

99.H5.sp6:I3I318.Seq 

RTAOOOOO 1 92AF.C.2.1 

99.A6.sp6:131235.Seq 

RTAOOOOO 192AF.e.3.1 

RTAOOOOO 1 92AF.g.3. 1 

99.E6.sp6: 131 283 .Seq 

RTA00000 1 92AF.i. 1 2. 1 

123.F7.sp6: 13233 l.Seq 

RTA00000192AF.I.1.1 

RTA00000 1 92AF.1. 1 . 1 .Seq_THC20207 1 

RTA00000 1 93 AF.b. 1 4. 1 

99.D8.sp6:131273.Seq 

RTAOOOOO 1 93 AF.c.2 1 . 1 .Seq_THC222602 

RTA00000193AF.C.21.1 

99.A9.sp6:I31238.Seq 

RTAOOOOO 1 93 AF.f.5.1 

RTAOOOOO 1 93 AF.f.22. 1 

RTAOOOOO 1 93 AF.g.2. 1 

RTA00000193AF.j.20.1 

RTAOOOOO 1 93 AF ji.7. 1 

RTAOOOOO 1 93 AF.n. 15.1 

RTAOOOOO 1 93 AF.n. 15.1 .Seq_THC2 1 5687 

RTA00000194ARU1.10.2 

RTAOOOOO 1 94AF.a. 11.1 

RTAOOOOO 1 94AF.C.23. 1 

RTAOOOOO 1 94AF.d. 13.1 
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The above material has been deposited with the American Type Culture Collection, 
Rockville, Maryland, under the accession number indicated. This deposit will be maintained 
under the terms of the Budapest Treaty on the International Recognition of the Deposit of 
Microorganisms for purposes of Patent Procedure. The deposit will be maintained for a 
period of 30 years following issuance of this patent, or for the enforceable life of the patent, 
whichever is greater. Upon issuance of the patent, the deposit will be available to the public 
from the ATCC without restriction. 

This deposit is provided merely as convenience to those of skill in the art, and is not 
an admission that a deposit is required under 35 U.S.C. §112. The sequence of the 
polynucleotides contained within the deposited material, as well as the amino acid sequence 
of the polypeptides encoded thereby, are incorporated herein by reference and are controlling 
in the event of any conflict with the written description of sequences herein. A license may 
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be required to make, use, or sell the deposited material, and no such license is granted 

hereby. 

Retrieval of Individual Clones from Deposit of Pooled Clones 
5 Where the ATCC deposit is composed of a pool of .cDNA clones, the deposit was 

prepared by first transfecting each of the clones into separate bacterial cells. The clones 
were then deposited as a pool of equal mixtures in the composite deposit. Particular clones 
can be obtained from the composite deposit using methods well known in the art. For 
example, a bacterial cell containing a particular clone can be identified by isolating single 

10 colonies, and identifying colonies containing the specific clone through standard colony 
hybridization techniques, using an oligonucleotide probe or probes designed to specifically 
hybridize to a sequence of the clone insert (e.g:, a probe based upon unmasked sequence of 
the encoded polynucleotide having the indicated SEQ ID NO). The probe should be 
designed to have a T m of approximately 80°C (assuming 2°C for each A or T and 4°C for 

15 each G or C). Positive colonies can then be picked, grown in culture, and the recombinant 
clone isolated. Alternatively, probes designed in this manner can be used to PCR to isolate a 
nucleic acid molecule from die pooled clones according to methods well known in the art, 
e.g., by purifying the cDNA from the deposited culture pool, and using the probes in PCR 
reactions to produce an amplified product having the corresponding desired polynucleotide 

20 sequence. 
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Table 1. Sequence identification numbers, cluster ID, sequence name, 
SEQ ID NO: Cluster ID Sequence Name 

RTAOOOOO 1 80AF.i.20. 1 

RTA0O0O0 1 85 AF .n. 12.1 

RTAOOOOO 1 87AF.m. 1 5.2 

RTA00000191AF.U7.2 

RTA00000181AF.f.5.1 

RTAOOOOO 1 83 AF.j. 1 1.1 

RTAOOOOO 192AF.i. 12.1 

RTAOOOOO 180AF.C.2.1 

RTAOOOOO 1 83 AF.a.6. 1 

RTAOOOOO 1 78AF ji.24. 1 

RTAOOOOO 1 37A.g.6. 1 

RTAOOOOO 187AF.1.7.1 

RTA000OO181AF.g.lO.l 

RTAOOOOO 1 79AF.n. 1 0. 1 

RTAOOOOO 1 92AF jn. 12. 1 

RTAOOOOO 1 84AF.k. 12. 1 

RTAOOOOO 1 89AF.g. 1 . 1 

RTAOOOOO 1 87AF.g. 12.1 

RTAOOOOO 1 20A.O.20. 1 

RTA00000191AFjl9.1 

RTAOOOOO 1 84AF.j J2 1 . 1 

RTAOOOOO 1 82AF.U0. 1 

RTA00OOO123A.g.l9.1 

RTAOOOOO 1 93 AF 1 6. 1 

RTA000OO193AF.f.5.1 

RTAOOOOOl 87AF.0.24.1 

RTAOOOOO 1 93 AF.f.22. 1 

RTAOOOOO 1 86AF.b.2 1 . 1 

RTAOOOOO 1 80AF.g_22. 1 

RTA0000Ol92AF.eJ.l 

RTAOOOOOl 94AF.f.4. 1 

RTAOOOOOl 18A.I.8.1 

RTAOOOOOl 80AF.H.9. 1 

RTAOOOOOl 78AF.0.23.1 

RTA00000181AF.C.21.1 

RTAOOOOO 1 87AF jl 15. 1 

RTAOOOOO 1 78AF.C.7.1 

RTA00000183AF.e.l.l 

RTAOOOOOl 18A.C.4.1 

RTA00000187AF.rn.23 .2 



I 


463 5 


2 




3 


4677 


4 


3706 

J / I/O 


5 


36S35 




3990 

J77U 


7 


53 TQ 






O 


7673 


10 

1 V 


/JO / 


1 1 

1 I 


/uOj 


17 




1 3 


X/xOU 


\A 




1 C 
I J 




to 


o/ol 


17 




1ft 
1 o 


1 1AAH 


10 


1 *C791 


20 
xv 


34in 


21 


70*55 


27 




23 




X*t 


107 1 O 


25 


IU714 


26 




27 


1 47X6 


28 


17004 


29 




30 


11777 


31 




32 




11 




34 


5832 


35 


7801 


36 


76760 


37 


40132 


38 




39 


4016 


40 


5382 



and clone name 
Clone Name 

M00001429B.A11 

M0000I608D:A11 

M00001686A.-E06 

M00004068B:A01 

M00001449A.G10 

M00001532B:A06 

M00004169C:C12 

M00001417A.E02 

M00001497A:G02 

MO00O1387B.G03 

M00001557A.D02 

M00001680D:F08 

M00O014S0A:D08 

M00001407B:D1 1 

M00004191D:BI 1 

M00001557D:D09 

M00003856B:C02 

MOO0OI676B.F05 

M0000I467A:D08 

M00003981A:E10 

M00001557A:D02 

M00001488B.F12 

M00001531A:H11 

M00004223A.G10 

M00004275C.C1 1 

M00003741D:C09 

M00004283B:A04 

M00001617C:E02 

M00001426B:D12 

M00004138B:H02 

M00005180C:G03 

M0000I450A:A1I 

M00001414A:B01 

MOO001388D:G05 

M00001446A:F05 

M00001657D:F08 

M00001365C:C10 

M00O01505C:C05 

M00001395A:C03 

M00001688C:F09 
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SEQ ID NO: Cluster ID 


Sequence Name 


41 


5693 


RTAOOOOO 1 90AF.p. 1 7.2 


42 


307 


RTAOOOOO 136A.0.4.2 


43 


39833 


RTA00000178AF.L23.1 


44 




RTAOOOOO 1 93 AFjn.5. 1 


45 


5325 


RTA00000191AF.O.6.I 


46 


5325 


RTAOOOOO 1 9 1 AF.o.6.2 


47 


18957 


RTAOOOOO 1 90AR.m.9. 1 


48 


39508 


RTAOOOOO 120A.O.2.1 


49 


22390 


RTAOOOOO 1 3 6A.j . 1 3 . 1 


50 


12170 


RTAOOOOO 1 25A.h. 1 8.4 


51 


4393 


RTAOOOOO 1 87AF.n. 1 7. 1 


52 


19 


RTAOOOOO 182AF.b.7.1 


53 




RTAOOOOO 1 93 AF.c.2 1.1 


54 


7899 


RTAOOOOO 1 89AF.C. 1 0. 1 


55 


40073 


RTA00000191AF.eJ.I 


56 


7005 


RTA000001 79AF.0.22.1 


57 




RTAOOOOO 1 87AF Ji.22. 1 


58 


18957 


RTAOOOOO 1 90AF.m.9.2 


59 


18957 


RTAOOOOO 1 83 AF.h.23. 1 


60 


16283 


RTA000001 82AF.C.22.1 


61 


6974 


RTAOOOOO 1 83 AF.d.9.1 


62 


2623 


RTAOOOOO 1 83 AF.b. 1 4.1 


63 


9105 


RTA00000191AFJL21.2 


64 


13238 


RTA000001 8 1 AFjn.4.1 


65 


5749 


RTAOOOOO 1 85 AF.aJ 9.1 


66 


6455 


RTAOOOOO 1 93 AF.b.9. 1 


67 


23001 


RTAOOOOO 1 85 AF.c.24.1 


68 


6455 


RTAOOOOO 1 92AF.g^3. 1 


69 


13595 


RTA00000189AFX8.1 


70 


39442 


RTAOOOOO 1 20 A.o^ 1.1 


71 


17036 


RTA000OO191AF.f.l3.1 


72 




RTAOOOOO 1 83 AF.g.9. 1 


73 


7005 


RTA000001 81 AFJc24.1 


74 


6268 


RTAOOOOO 1 26A.0.23 . 1 


75 


16130 


RTA00000119A.C.13.1 


76 


23201 


RTA00000187AFa14.1 


77 


5321 


RTAOOOOO 1 83 AF.k.8.1 


78 


13157 


RTAOOOOO 1 86 AF.a.6.1 


79 


2102 


RTAOOOOO 193AFju7. I 


80 


1058 


RTAOOOOO 1 26 A.e 2.0 3 


81 


40392 


RTAOOOOO 1 80 AF.j.8.1 


82 




RTAOOOOO 1 83 AF.e^3. 1 


83 


11476 


RTAOOOOO 1 87AF.p. 1 9. 1 
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Clone Name 

M00003978B:G05 

M0000I552A.B12 

M00001378B:B02 

M00004359B:G02 

M00004093D:B12 

M00004093D:B12 

M00003958A.H02 

M00001467A:D04 

M00001551A:G06 

M00001544A:£03 

M00001693C.G01 

M00001463C:B11 

M00004249D:F10 

M00003837D:A01 

M00004028D:C05 

M0000I410A.D07 

M00001679A:F06 

M00003958A.H02 

M0000I528A.F09 

M00001467A:D08 

M00001504CJI06 

MOOOOI500A.E11 

M00003983A^\05 

MOOOO1455A:E09 

M00001571CJ106 

M00004229BdF08 

M00001578B£04 

M00004157CA09 

M00003851B:D10 

M00001467A:E10 

M00004035D:B06 

M00001513BK303 

M00001454B:C12 

M00001551A.B10 

M0OO01453A:Ell 

M00001657D:C03 

M00001534A:F09 

M000O1614C-.F10 

M00004377C:F05 

M00001548A:H09 

M00001429D:D07 

M000O1506DA09 

M00003747D:C05 



WO 99/33982 
SEQ ED NO: Cluster 

729 6539 

730 6874 
731 

732 13595 

733 5619 

734 10515 

735 ' 4622 

736 3389 

737 4718 
738 

739 12977 

740 8479 
741 

742 7798 

743 5345 

744 31587 

745 14507 

746 13576 
747 

748 9285 

749 39809 

750 16317 

751 8672 

752 12532 

753 3900 

754 23255 

755 24488 

756 40122 

757 23210 

758 23358 

759 -it 3430 

760 2433 

761 9105 

762 6124 

763 40073 

764 37285 

765 17036 

766 3706 
767 

768 15069 

769 9285 

770 6880 

771 5325 



ID Sequence Name 

90.B10.sp6:130880.Seq 

90.C 1 0^p6: 1 30892.Seq 

90 JD 1 0.sp6: 1 30904.Seq 

90.E10.sp6:130916.Seq 

90.F10.sp6:130928.Seq 

90.GI0.sp6:130940.Seq 

90.H10.sp6:130952.Seq 

90.AlI.sp6:130869.Seq 

90.Bll.sp6: 130881. Seq 

90.Cll.sp6: 130893 .Seq 

90.Fll.sp6:130929.Seq 

90.GIl.sp6:I30941.Seq 

90.Hll.sp6:130953.Seq 

90.A12.sp6: 1 30870.Seq 

90.B12.sp6:130882.Seq 

90.C12.sp6:130894.Seq 

90.D12.sp6:130906.Seq 

90.F12.sp6:130930.Seq 

90.G12.sp6: 13 0942. Seq 

90.H12 <sp6: 130954.Seq 

99Aljp6:131230.Seq 

99.Bl.sp6:131242.Seq 

99.Cl^p6:131254.Seq 

99.Dl^p6:131266.Seq 

99.El^p6:131278.Seq 

99.Fl.sp6:131290.Seq 

99.C2.sp6:13125S.Seq 

99.D2.sp6:131267.Seq 

99£2^p6:131279.Seq 

99JF2^p6:13I291^eq 

99A3jp6:131232.Seq 

9933.sp6:131244.Seq 

99.C3^p6:13I256.Seq 

99.D3.sp6:131268.Seq 

99.E3.sp6:131280.Seq 

99JCsp6:l31316.Seq 

99A4.sp6:131233.Seq 

99.C4.sp6:131257.Seq 

99.D4jsp6:131269.Seq 

99.F4.sp6:131293.Seq 

99Ji4.sp6:1313I7.Seq 

99A5^p6:131234.Seq 

99.C5.sp6:131258.Seq 
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Clone Name 

M00003844C.BM 

M00003846B.O06 

M00003851B.-D08 

M00003851B:DI0 

M00003853A.D04 

M00003853A.F1 2 

M00003856B:C02 

M00003857A:G10 

M00003857A.H03 

M00003867A:D10 

M00003875B:F04 

M00003875C:G07 

M00003875D:DI1 

M00003876D:E12 

M00003879B:C1 1 

M00003879B:D10 

M00003879D.A02 

M00003885C:A02 

M00003891C:H09 

M00003906C:E10 

M00003907D:A09 

M00003907D:H04 

M00003909D:C03 

M00003912B:D01 

M00003914C:F05 

M00003922A:E06 

M00003968B:F06 

M00003970C:B09 

M00003974D:E07 

M00003974D:H02 

M00003981A:E10 

M00003982C.-C02 

M00003983A^05 

M00004028D.A06 

M00004028D.C05 

M00004035C:A07 

M00004035D.B06 

M00004068B:A01 

M00004072A:C03 

M00004081C:D10 

M00004086O:G06 

M00004087D:A01 

M00004093D:B12 
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Table 2 



*5 


Nearest 
Neighbor 
(BlastN vs. 
Geobank) 






Nearest 
Neighbor 
(BlastXvs. 
Non- 
Redundant 
Proteins) 






SEQ 
ID 


ACCESSION 


DESCRIPTION 


P 

VALUE 


ACCESSION 


DESCRIPTION 


P 

VALUE 


.42 


<NONE> 


<NONE> 


<NONE> 


CEC01G10_5 


Caenorhabditis elegans 
cosmid COlG 10, 
complete sequence; 
C01G10;8; CDNA EST 
CEMSC45R comes 
from this 

gene>GP:CECO 1 G 1 0_ 
5 Caenorhabditis 
elegans cosmid 
C01G10; C01G10;8; 
CDNA EST 
CEMSC45R comes 
from this gene 


2.30E-12 


43 


<NONE> 


<NONE> 


<NONE> 


HSU15779_1 


Human p70 (ST5) 
mRNA, alternatively 
spliced, complete cds; 
Differentially 
expressed; alternatively 
spliced 


9.50E-14 


44 


<NONE> 


<NONE> 


<NONE> 


MTCY210_31 


Mycobacterium 
tuberculosis cosmid 
Y2lO;Unknowi; 
MTCY210;31, 
unknown, len: 299 aa, 
slight similarity to 
carboxykinases 


L70E-17 


45 


U61403 


Dictyostelium 
discoideum PrlA 
(prlA) mRNA, 
partial cds. 


1 


U93472J 


Danio rerio PPARB 
gene, partial cds; 
Nuclear receptor C 
domain 


0.95 


46 


Z92832 


Caenoriiabditis 
elegans DMA *** 
SEQUENCING 
IN PROGRESS 
*** from clone 
F31D4; HTGS 
phase 1. 


I 


U93472_l 


Danio rerio PPARB 
gene, partial cds; 
Nuclear receptor C 
domain 


0.94 


47 


L36557 


Oryza sativa 
(clone pRG3) 
repetitive 
element 


I 


HSU61262_1 


Human neogenin 
mRNA, complete cds 


0.89 
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Tabic 2 



Nearest Neighbor (BlastN vs. Gcnbank) |Neare st Neighbor (BlastX vs. Non-Redundant Pmtein^ 



766 I AF017357 



I 



SEQ | ACCESSION | DESCRIPTION 
ID 



lOryza sativa low 
jmolecuiar early 
liight-inducible 
(protein mRNA, 
[complete cds 



767 



U67599 



| Methanococcus 
jannaschit section 
1 141 of 150 of the 
complete genome 



P VALUE] ACCESSION 



0.38 



RGS3 HUMAN 



0.13 



<NONE> 



DESCRIPTION P VALUE 



REGULATOR OF 
G-PROTEIN 
SIGNALLING 3 
(RGS3) (RGP3) 



0.23 



<NONE> 



<NONE> 



768 



X74I78 



B.taurus 
microsatellite 
DNA INRA153 



0.13 



FAG I SYNY3 



IP73574 

Jsynechocystis sp. 
[(strain pec 6803). 3- 
Joxoacyl-(acyI-carrier J 
|protein] reductase 1 
(ec 1.1. 1.100) (3- 
Iketoacyl- acyl carrier | 
jprotein reductase 1). 
1 11/98 



5.00E-16 



769 | AF041858 



IMus musculus 
Isynaptojanin 2 
Jisoform delta 
ImRNA, partial 
Icds 



0.043 



CA44 HUMAN 



COLLAGEN 
ALPHA 4(1 V) 
CHAIN 
PRECURSOR 



0.24 



7701 JO 1 404 IDrosophila 
Imelanogaster 
ImitochondriaJ 
{cytochrome c 
(oxidase subunits, 
|ATPase6, 7 
ItRNAs (Trp, Cys, 
Tyr, Leu(UURX 
Lys, Asp, Gly) 
[genes, and 
I unidentified 
(reading frames 
A6U2and3. 



0.021 



NU1M CITLA 



INADH- 
UBIQUINONE 
OXIDOREDUCTAS j 
E CHAIN 1 (EC 
1.6.5.3) 



7.2 



771 I AL022317 



(Human DNA 
I sequence from 
(clone 140L1 on 
(chromosome 
22ql3.1-I3Jl, 
[complete 
I sequence [Homo 
(sapiens] 



3.00E-41 I ALU7 HUMAN 



I!!!! ALU 
SUBFAMILY SQ 
WARNING ENTRY 



4.00E-08 



hut 



772 



U95094 



iXenopus laevis 
XL-INCENP 
(XL-INCENP) 
mRNA, complete 
(cds 



l.OOE-09 



<NONE> 



<NONE> 



<NONE> 
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1. A library of polynucleotides, the library comprising the sequence information of 
at least one of SEQ ID NOS:l-844. 

2. The library of claim 1, wherein the library is provided on a nucleic acid array. 

3. The library of claim 1, wherein the library is provided in a computer-readable 

format 

4. The library of claim 1, wherein the library comprises a differentially expressed 
polynucleotide comprising a sequence selected from the group consisting of SEQ ID NOS:9, 
39, 42, 52, 62, 74, 1 19, 172, 317, and 379. 

5. The library of claim 1, wherein the library comprises a polynucleotide 
differentially expressed in a human breast cancer cell, where the polynucleotide comprises a 
sequence selected from the group consisting of SEQ ID NOS: 4, 9, 39, 42, 52, 62, 65, 66, 68, 
74, 81, 114, 123, 144, 130, 157, 162, 172, 178, 183, 202, 214, 219, 223, 258, 298, 317, 338, 
379, 384, 386, and 388, 

6. The library of claim 1, wherein the library comprises a polynucleotide 
differentially expressed in a human colon cancer cell, where the polynucleotide comprises a 
sequence selected from die group consisting of SEQ ID NOS: 1, 39, 52, 97, 119, 134, 172, 
176, 241, 288, 317, 357, 362, and 374. 

7. The library of claim 1, wherein the library comprises a polynucleotide 
differentially expressed in a human lung cancer cell, where the polynucleotide comprises a 
sequence selected from the group consisting of SEQ ID NOS: 9, 34, 42, 62, 74, 106, 119, 
135, 154, 160, 260, 308, 323, 349, 361, 369, 371, 379, 395, 381, and 400. 



8. An isolated polynucleotide comprising a nucleotide sequence having at least 90% 
sequence identity to an identifying sequence of SEQ ID NOS: 1-844 or a degenerate variant 
thereof 
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) 



9. An isolated polynucleotide according to claim 8, wherein the polynucleotide 
comprises a seqeuence encoding a polypeptide of a protein family selected from the group 
consisting of: 4 transmembrane segments integral membrane proteins, 7 transmembrane 
receptors, ATPases associated with various cellular activities (AAA), eukaryptic aspartyl 
proteases, GATA family of transcription factors, G-protein alpha subunit, phorbol 
esters/diacylglycerol binding proteins, protein kinase, protein phosphatase 2C, protein 
tyrosine phosphatase, trypsin, wnt family of developmental signaling proteins, and 
WW/rsp5/WWP domain containing proteins. 

1|. The polynucleotide of claim 9, wherein the polynucleotide comprises a 
sequence ofoneofSEQ ID NOS: 24, 41, 101, 157,291,305,315,341,63, 116, 134, 136, 
151, 384, 404, 308, 213, 367, 188, 251, 202, 315, 367, 397, 256, 382, 169, 23, 291, 324, 330, 
341,353, 188, 379, and 395. 

1 1 . The polynucleotide of claim 8, wherein the polynucleotide comprises a 
seqeuence encoding a polypeptide having a functional domain selected from the group 
consisting of: Ank repeat, basic region plus leucine zipper transcription factors, 
bromodomain, EF-hand, SH3 domain, WD domain/G-beta repeats, zinc finger (C2H2 type), 
zinc finger (CCHC class), and zinc-binding metalloprotease domain. 

12. The polynucleotide of claim 1 1, wherein the polynucleotide comprises a 
^^tfoneofSEQIDNOS: 116,251,374,97, 136,242,379,306,386, 18,335,61, 
306, 386,322, 306, and 395. 

- ' . 

13. A recombinant host cell containing the polynucleotide of claim 8. 

14. An isolated polypeptide encoded by the polynucleotide of claim 8. 

15. An antibody that specifically binds a polypeptide of claim 14. 

16. A vector comprising the polynucleotide of claim 8. 
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17. A polynucleotide comprising the nucleotide sequence of an insert contained in a 
clone deposited as ATCC accession number xx, xx, xx, xx, xx, xx, xx, xx, or xx. 



18. A method of detecting differentially expressed genes correlated with a cancerous 
5 state of a mammalian cell, the method comprising the step of; 

detecting at least one differentially expressed gene product in a test sample derived 
from a cell suspected of being cancerous, where the gene product is encoded by a gene 
corresponding to a sequence of at least one of SEQ ID NOS:4, 9, 39, 42, 52, 62, 65, 66, 68, 
74,81, 114, 123, 144, 130, 157, 162, 172, 178, 183,202,214,219, 223,258, 298,317, 338, 
10 379, 384, 386, 388, 1, 39, 52, 97, 119, 134, 172, 176, 241, 288, 317, 357, 362, 374, 9, 34, 42, 
62, 74, 106, 1 19, 135, 154, 160, 260, 308, 323, 349, 361, 369, 371, 379, 395, 381, and 400; 

wherein detection of the differentially expressed gene product is correlated with a 
cancerous state of the cell from which the test sample was derived. 

15 19. The method of claim 18, wherein said detecting step is by hybridization of the 

test sample to a reference array, wherein the reference array comprises an identifying 
sequence of at least one of SEQ ID NOS: 1-844. 



20. The method of claim 18, wherein the cell is a breast tissue derived cell, and the 
20 differentially expressed gene product is encoded by a gene corresponding to a sequence of at 

least one of SEQ ID NOS: 4, 9, 39, 42, 52, 62, 65, 66, 68, 74, 81, 1 14, 123, 144, 130, 157, 
162, 172, 178, 183, 202, 214, 219, 223, 258, 298, 317, 338, 379, 384, 386, and 388. 

21. The method of claim 18, wherein the cell is a colon tissue derived cell, and the 
25 differentially expressed gene product is encoded by a gene corresponding to a sequence of at 

least one of SEQ ID NOS: 1, 39, 52, 97, 1 19, 134, 172, 176, 241, 288, 317, 357, 362, and 
374. 

22. The method of claim 18, wherein the cell is a lung tissue derived cell, and the 
30 differentially expressed gene product is encoded by a gene corresponding to a sequence of at 

least one of SEQ ID NOS: 9, 34, 42, 62, 74, 106, 1 19, 135, 154, 160, 260, 308, 323, 349, 
361, 369, 371, 379, 395, 381, and 400. 
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<210> 44 

<211> 300 

<212> DNA 

<213> Homo sapiens 



<400> 44 

ggcttataca acatagtggg gaacgcatgg gaatggactt cagactggtg gactgttcat 60 

cattctgttg aagaaacgct taacccaaaa ggtccccctt ctgggaaaga ccgagtgaag 120 

aaaggtggat cctacatgtg ccataggtct tattgttaca ggtatcgctg tgctgctcgg 180 

agccagaaca cacctgatag ctctgcttcg aatctgggat tccgctgtgc agccgaccgg 240 

ctgcccacta tggactgaca accaaggaaa gtcttcccca gtccaaggag cagccgtgtc 300 



<210> 45 

<211> 300 

<212> DNA 

<2X3> Homo sapiens 



<400> 45 

gtggaagaaa attttttgct gcttctggtt cccagaaaag ggagccattt taacagacac 60 

atctgtcaaa agaaatgact tgtcgattat ttctggctaa tttttcttta tagcagagtt 120 

tctcacacct ggcgagctgt ggcatgcttt taaacagagt tcatttccag taccctccat 180 

cagtgcaccc tgctttaaga aaatgaactt atgcaaatag acatccacag cgtcggtaaa 240 

ttaaggggtg atcaccaagt ttcataatat tttcccttta taaaaggatt tgttggccag 300 

<210> 46 

<211> 300 

<212> DNA 

<213> Homo sapiens 



<400> 46 

gtggaagaaa attttttgct gcttctggtt cccagaaaag ggagccattt tangngacac 60 

atctgtcaaa agaaatgact tgtcgattat ttctggctaa tttttcttta tagcagagtt 120. 

tctcacacct ggcgagctgt ggcatgcttt taaacagagt tcatttccag taccctccat 180 

cagtgcaccc tgctttaaga aaatgaactt atgcaaatag acatccacag cgtcggtaaa 240 

ttaaggggtg atcaccaagt ttcataatat tttcccttta taaaaggatt tgttggccag 300 

<210> 47 

<211> 300 

<212> DNA 

<213> Homo sapiens 

<400> 47 

acacagataa ttttaataca atgtgaaaaa gtgtatgggt gtgtagaaga ggggttctta 60 

gagtttctgg agagaatgat tctgagctcg gttttgacaa aagaggagct gctgaggcta 120 

aaagtggatg aaaagggcct tataattaaa agaaacaaga caggactcag aggtgtgaaa 180 

caaatattat gcatggtgaa ttacaatgag ttgggggtat tctgtagccc taaagtacaa 240 

ggtataaaga gacagaaaat gatcctggaa tatagacaga gga tact tea tctctcatga 300 

<210> 48 
<211> 300 
<212> DNA 
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<400> 770 



cccttttttt tttttcccnn 
ccnaaaaaaa aattgggncc 
tttttggggn nnnnnaaann 
gggnnnannc cnccccccaa 
cagggnnnag aataaggccc 
caaaggnaaa tccnttggga 
aaaaatgggg aaangnaaaa 
tttcattggg ggtncntggg 
cctttaattg ccattaagca 
cttcganagn gaaaatcaac 
gaagctagaa cattagaagc 
aactgccaaa gagtgttgtt 
ngaaaaaaac 1 1 1 a t ang 1 1 
ttctnccaag agatatcctt 
aataaatnnc ttcnagaagc 
agcacg t 1 c t t gacnacaga 
aaggtgacta ttnaanagct 
ttcaanncca tttttangat 
gtttaaatta onaaanaanc 
ttttnnncnt ntnaann 



aaaaaaanat tggggncccn tttttttggg nttttttttc 
ctttttgggg ggnntnaaaa aaannnnnnn nccccccntt 
r.nnnnnncnn nttnnnnnnn nnnnnnnnnn ggnnttnnng 
tttcccggnn attnttccgg gcccaatttt tgggaccccc 
ggggnttttt tttncnaggg ncccaaaagg gcccttgggc 
aattttggga atttggccct tggnanntcc caataccggn 
aaggnttncn ccaaattggt tggggggggg ttccaaagat 
ctttcaaccc naaggnaang ggtttncttt caaaaaatta 
attcccaang gttannaaag ggtgtttntt ctcanctatg 
naatggaaaa tgtgttgtaa ttggtctgca ntctacanga 
tttggaanag ggcggnggag aattgaatga tntttgnttc 
gcagtcactc atttgaaaaa ctattttcct gctccagaca 
tactaggaat cgatttgaca agcnttcang taacaaacag 
gttnaagaan nattanaata ncnngaaagc ggaaanngtg 
ccaaaaaxinc acngaanaag tatggtgggn cttactggtt 
tggaaattga axitctngatt ncctctgatt antgaatgaa 
cttnanatac catgagtntt tggancattg attgaccaat 
ngaattntta tnaatgattn attnanaant gannnccttn 
cntcaaaana cnanagggga tttataaaat ctaataanan 



1020 
1080 
1140 
1157 



60 
120 
180 
240 
300 
360 
420 
480 
540 
600 
660 
720 
780 
640 
900 
960 



<210> 771 
<211> 760 
<212> DNA 
<213> Homo sapiens 

<400> 771 

ngncctttna tnccttntga anccntttgn aattnctcnn nnngttgatc ccatcgattc 60 

gaattcggca cgaggtggaa gaaaattttt tgctgcttct ggttnccaga aaagggagcc 120 

attttaacag acacatctgt caaaagaaat gacttgtcga ttatttctgg ctaatttttc 180 

tttatagcag agtttctcac acctggcgag ctgtggcatg cttttaaaca gagttcattt 240 

ccagtaccct ccatcagtgc accctgcttt aagaaaatga acttatgcaa atagacatcc 300 

acagcgtcgg taaattaagg ggtgatcacc aagtttcata atattttccc tttataaaag 360 

gatttgttgg ccaggtgcag tggttcatgc ctgtaatccc agcagtttgg gaggctgagg 420 

tgg g tggatc acctgaggtc aggagttcga gaccaacctg accaacatgg tgagaccccc 480 

gtctctacta aaaataaaaa aaaaattagc tgggagtggn ggtgggcacc tgtaatccta 540 

gctacttggg aggctgaacc aggagaatct cttgaacctg ggaggcanag gttgcaagtg 600 

agcccgagat cgtgccattg cactccaacc agggcaacaa gagtgaaact ccatcttaaa 660 

aaanaaaaan gaaaactcga gcctctagaa ctatagtgag tcgtattacg tagatccaga 720 

catgataaga tacattgatg aattttggac aaaccccann 760 

<210> 772 
<211> 777 
<212> DMA. 
<213> Homo sapiens 

<400> 772 

gaaancccat ttnnnnrittc cncttcnaat cccttgghta. . ctcgntcttt ntgcaggatc 60 

ccatcgattc gaattcggca cgagctctae taaaaataca. aaaattagct gggcgtggtg 120 

gcacacacct gtaatcccag ttacttggga ggctgaggca caagaatcgc ttgaacccgg 180 
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